Title: Visualization of Heterogeneous Data
1Visualization of Heterogeneous Data
- Mike Cammarano
- Xin (Luna) Dong
- Bryan Chan
- Jeff Klingner
- Justin Talbot
- Alon Halevy
- Pat Hanrahan
2Homogeneous data is easy.
Company Founded Headquarters Logo
Microsoft 1975 47.6 N, 122.1 W
Enron 1985 29.7 N, 95.3 W
Google 1998 37.4 N, 122.0 W
3Homogeneous data is easy.
Company Founded Headquarters Logo
Microsoft 1975 47.6 N, 122.1 W
Enron 1985 29.7 N, 95.3 W
Google 1998 37.4 N, 122.0 W
1975
1985
1998
1970 1980 1990 2000
4Homogeneous data is easy.
Company Founded Headquarters Logo
Microsoft 1975 47.6 N, 122.1 W
Enron 1985 29.7 N, 95.3 W
Google 1998 37.4 N, 122.0 W
1970 1980 1990 2000
5Multiple sources?
- Collaborative content
- Semi-structured data
Infobox Writer bgcolour silver name
Edgar Allan Poe image
Edgar_Allan_Poe_2.jpg caption This
daguerreotype of Poe was taken in 1848 ...
birth_date birth date1809119mfy
birth_place Boston, Massachusetts United
StatesU.S. death_date death date and
age1849100718090119 death_place
Baltimore, Maryland United StatesU.S.
occupation Poet, short story writer, editor,
literary critic movement Romanticism,
Dark romanticism genre Horror
fiction, Crime fiction, Detective
fiction magnum_opus The Raven spouse
Virginia Eliza Clemm Poe ...
6DBpedia.org
According to DBpedia.org
- DBpedia is a community effort to extract
structured information from Wikipedia and to make
this information available on the Web. - The DBpedia dataset currently provides
information about more than 1.95 million
things, including at least - 80,000 persons
- 70,000 places
- 35,000 music albums
- 12,000 films
7Database size
- We use a subset of DBpedia, mostly infoboxes and
geonames. - 30 M triples
- 2.5 GB
- We currently use an in-memory database.
- Hardware is dual processor, dual core AMD opteron
280s w/ 8GB RAM.
8A glimpse inside DBpedia
9A glimpse inside DBpedia
dbp PLACE_OF_BIRTH dbp latitude 39 41 45?
N
dbp birth_place w3c owlsameAs geonames
latitude 42.358403
10Heterogeneity
- Types
- Decimal vs. sexagesimal coordinates
- Names
- PLACE_OF_BIRTH vs. birth_place
- Paths
- dbp PLACE_OF_BIRTH dbp latitude
- vs.
- dbp birth_place w3c owlsameAs geonames
latitude
39 41 45? N 39.70
11Scenario / Demo
12Scenario / Demo
13Scenario / Demo
14Scenario / Demo
15Scenario / Demo
16Scenario / Demo
17Scenario / Demo
18Vision Self-configuring data
19Contributions
- Visualize heterogeneous data represented as a
graph of relationships between objects - Describe inputs to a visualization
- Visualization template
- Set of keywords per attribute
- Find attributes needed for a visualization by
searching paths - Within an iterative process of search,
visualization, and refinement - Present algorithm for finding and ranking paths
based on keywords - Efficiently enumerate paths
- A
- Random sampling
- Rank according to
- Keywords
- Heuristics about graph structure
20Integrate searching and visualization
- Search for potentially
- desirable paths
- Refine path Visualize results
- selections in context
21Matching problem
- Find the best path to a number for state
latitude
latitude
42.4
capital
governor
children
4
pop
state
6349000
birthplace
spouse
latitude
39.0
DianneFeinstein
party
HarryReid
name
houseleader
color
blue
22Basic algorithm
- Find the best path to a number for state
latitude
latitude
42.4
capital
governor
children
4
pop
state
6349000
birthplace
spouse
latitude
39.0
DianneFeinstein
party
HarryReid
name
houseleader
color
blue
3. Score andrank pathsusing TF/IDF
2. Find paths endingin a number
1. Explore graph
23Improving execution time
- New pruning techniques since the paper submission
- A
- Bidirectional search on terms
- Random sampling
24Pruning techniques
- Most paths do not correspond to a state
latitude - How can we avoid such bad paths?
25Pruning techniques / A Search
- Use a scoring function that penalizes unrelated
terms
- Then an A search ignores paths with many such
terms
latitude
42.4
capital
governor
children
4
pop
state
6349000
birthplace
spouse
latitude
39.0
DianneFeinstein
party
HarryReid
name
houseleader
color
blue
26A pruning results
- Senators on map
- Average of edges examined at each depth, full
enumeration - Average of edges examined at each depth, using
A
1 2 3 4
Image 66 5409 134226 1393766
Name 66 5446 168673 5245035
latitude 66 5408 145549 1009247
1 2 3 4
Image 66 2049 1615 198
Name 66 9 5092 228
latitude 66 598 2272 2148
27Pruning techniques / Random Sampling
- Do normal A search for n randomly chosen nodes
latitude
42.4
capital
governor
children
4
pop
state
6349000
birthplace
spouse
latitude
39.0
party
HarryReid
name
houseleader
color
blue
28Pruning techniques / Random Sampling
- Do normal A search for n randomly chosen nodes
- Only search known hits for the remaining nodes
- Prevents repeatedly checking where there are
likely no paths
latitude
42.4
capital
governor
children
4
pop
state
6349000
birthplace
spouse
latitude
39.0
party
HarryReid
name
houseleader
color
blue
29Sampling results
- Average edges examined at all depths
- Total edges examined
- without sampling 736099 728640
- with sampling 736010 58089
125220
Seed nodes (10) Others (89)
Image 920 82
Name 40 35
State 200 175
Latitude 3100 144
Longitude 3100 144
TOTAL 7360 580
30Performance
- Runtime for senators example
- Runtime for astronauts example
- Runtime for each field in countries example
- Performance now interactive
- With new pruning techniques, 100x faster than
reported in paper.
State latitude State longitude Image Name Instances total
0.911 0.854 0.542 0.513 0.187 3.007 sec
Mission launch Mission insignia Name Instances total
1.109 1.151 0.743 1.102 4.105 sec
GDP per capita Inflation Flag Name Instances total
1.142 2.228 0.867 1.108 1.136 6.481 sec
31Variations senators flags versus birth places
32Timeline of manned spaceflight
33Scatterplot of inflation vs. GDP
34Precision / Recall
Senators image
Correct Incorrect
86 6 Accepted
0 6 Rejected
Senators state latitude
Correct Incorrect
64 34 Accepted
1 0 Rejected
Countries gdp per capita
Correct Incorrect
206 58 Accepted
9 0 Rejected
35Summary
- Visualize heterogeneous data represented as a
graph of relationships between objects - Produce visualizations conforming to templates by
searching for needed attributes - Present algorithm for finding and ranking paths
based on keywords - Efficiently enumerate paths
- Rank
- Now fast enough for interactive use
- High precision and recall
36Future work
- Improvements
- UI support for initial discovery and query
refinement - Robustness of terms / Improved ranking
- Automatic selection of visualization
- Visualizing missing data
- Visualizations that reflect result relevance
(selective emphasis) - Deploy on the web
- Wikipedia
- The whole web
37Acknowledgements
- Funding sources
- Boeing
- RVAC
- CALO
- Tools and data
- DBpedia
- MIT SIMILE project timeline
- Tom Pattersons map artwork
38The end!
39Pruning techniquesBidirectional Search
- Before A, search one step back from each
literal,following only edges that match
keywords
- This saves one step during forward A search
latitude
42.4
capital
governor
children
4
pop
state
6349000
birthplace
spouse
latitude
39.0
DianneFeinstein
party
HarryReid
name
houseleader
color
blue
40Need for multiple paths
41Need for multiple paths