Visualization of Heterogeneous Data - PowerPoint PPT Presentation

About This Presentation
Title:

Visualization of Heterogeneous Data

Description:

Visualization of Heterogeneous Data. Mike Cammarano. Xin (Luna) Dong. Bryan Chan. Jeff Klingner ... caption = This [[daguerreotype]] of Poe was taken in 1848 ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 42
Provided by: melissa4
Category:

less

Transcript and Presenter's Notes

Title: Visualization of Heterogeneous Data


1
Visualization of Heterogeneous Data
  • Mike Cammarano
  • Xin (Luna) Dong
  • Bryan Chan
  • Jeff Klingner
  • Justin Talbot
  • Alon Halevy
  • Pat Hanrahan

2
Homogeneous data is easy.
Company Founded Headquarters Logo
Microsoft 1975 47.6 N, 122.1 W
Enron 1985 29.7 N, 95.3 W
Google 1998 37.4 N, 122.0 W
3
Homogeneous data is easy.
Company Founded Headquarters Logo
Microsoft 1975 47.6 N, 122.1 W
Enron 1985 29.7 N, 95.3 W
Google 1998 37.4 N, 122.0 W
1975
1985
1998
1970 1980 1990 2000
4
Homogeneous data is easy.
Company Founded Headquarters Logo
Microsoft 1975 47.6 N, 122.1 W
Enron 1985 29.7 N, 95.3 W
Google 1998 37.4 N, 122.0 W
1970 1980 1990 2000
5
Multiple sources?
  • Collaborative content
  • Semi-structured data

Infobox Writer bgcolour silver name
Edgar Allan Poe image
Edgar_Allan_Poe_2.jpg caption This
daguerreotype of Poe was taken in 1848 ...
birth_date birth date1809119mfy
birth_place Boston, Massachusetts United
StatesU.S. death_date death date and
age1849100718090119 death_place
Baltimore, Maryland United StatesU.S.
occupation Poet, short story writer, editor,
literary critic movement Romanticism,
Dark romanticism genre Horror
fiction, Crime fiction, Detective
fiction magnum_opus The Raven spouse
Virginia Eliza Clemm Poe ...
6
DBpedia.org
According to DBpedia.org
  • DBpedia is a community effort to extract
    structured information from Wikipedia and to make
    this information available on the Web.
  • The DBpedia dataset currently provides
    information about more than 1.95 million
    things, including at least
  • 80,000 persons
  • 70,000 places
  • 35,000 music albums
  • 12,000 films

7
Database size
  • We use a subset of DBpedia, mostly infoboxes and
    geonames.
  • 30 M triples
  • 2.5 GB
  • We currently use an in-memory database.
  • Hardware is dual processor, dual core AMD opteron
    280s w/ 8GB RAM.

8
A glimpse inside DBpedia
9
A glimpse inside DBpedia
dbp PLACE_OF_BIRTH dbp latitude 39 41 45?
N
  • Kerry
  • Poe

dbp birth_place w3c owlsameAs geonames
latitude 42.358403
10
Heterogeneity
  • Types
  • Decimal vs. sexagesimal coordinates
  • Names
  • PLACE_OF_BIRTH vs. birth_place
  • Paths
  • dbp PLACE_OF_BIRTH dbp latitude
  • vs.
  • dbp birth_place w3c owlsameAs geonames
    latitude

39 41 45? N 39.70
11
Scenario / Demo
12
Scenario / Demo
13
Scenario / Demo
14
Scenario / Demo
15
Scenario / Demo
16
Scenario / Demo
17
Scenario / Demo
18
Vision Self-configuring data
19
Contributions
  • Visualize heterogeneous data represented as a
    graph of relationships between objects
  • Describe inputs to a visualization
  • Visualization template
  • Set of keywords per attribute
  • Find attributes needed for a visualization by
    searching paths
  • Within an iterative process of search,
    visualization, and refinement
  • Present algorithm for finding and ranking paths
    based on keywords
  • Efficiently enumerate paths
  • A
  • Random sampling
  • Rank according to
  • Keywords
  • Heuristics about graph structure

20
Integrate searching and visualization
  • Search for potentially
  • desirable paths
  • Refine path Visualize results
  • selections in context

21
Matching problem
  • Find the best path to a number for state
    latitude

latitude
42.4
capital
governor
children
4
pop
state
6349000
birthplace
spouse
latitude
39.0
DianneFeinstein
party
HarryReid
name
houseleader
color
blue
22
Basic algorithm
  • Find the best path to a number for state
    latitude

latitude
42.4
capital
governor
children
4
pop
state
6349000
birthplace
spouse
latitude
39.0
DianneFeinstein
party
HarryReid
name
houseleader
color
blue
3. Score andrank pathsusing TF/IDF
2. Find paths endingin a number
1. Explore graph
23
Improving execution time
  • New pruning techniques since the paper submission
  • A
  • Bidirectional search on terms
  • Random sampling

24
Pruning techniques
  • Most paths do not correspond to a state
    latitude
  • How can we avoid such bad paths?

25
Pruning techniques / A Search
  • Use a scoring function that penalizes unrelated
    terms
  • Then an A search ignores paths with many such
    terms

latitude
42.4
capital
governor
children
4
pop
state
6349000
birthplace
spouse
latitude
39.0
DianneFeinstein
party
HarryReid
name
houseleader
color
blue
26
A pruning results
  • Senators on map
  • Average of edges examined at each depth, full
    enumeration
  • Average of edges examined at each depth, using
    A

1 2 3 4
Image 66 5409 134226 1393766
Name 66 5446 168673 5245035
latitude 66 5408 145549 1009247
1 2 3 4
Image 66 2049 1615 198
Name 66 9 5092 228
latitude 66 598 2272 2148
27
Pruning techniques / Random Sampling
  • Do normal A search for n randomly chosen nodes

latitude
42.4
capital
governor
children
4
pop
state
6349000
birthplace
spouse
latitude
39.0
party
HarryReid
name
houseleader
color
blue
28
Pruning techniques / Random Sampling
  • Do normal A search for n randomly chosen nodes
  • Only search known hits for the remaining nodes
  • Prevents repeatedly checking where there are
    likely no paths

latitude
42.4
capital
governor
children
4
pop
state
6349000
birthplace
spouse
latitude
39.0
party
HarryReid
name
houseleader
color
blue
29
Sampling results
  • Average edges examined at all depths
  • Total edges examined
  • without sampling 736099 728640
  • with sampling 736010 58089
    125220

Seed nodes (10) Others (89)
Image 920 82
Name 40 35
State 200 175
Latitude 3100 144
Longitude 3100 144
TOTAL 7360 580
30
Performance
  • Runtime for senators example
  • Runtime for astronauts example
  • Runtime for each field in countries example
  • Performance now interactive
  • With new pruning techniques, 100x faster than
    reported in paper.

State latitude State longitude Image Name Instances total
0.911 0.854 0.542 0.513 0.187 3.007 sec
Mission launch Mission insignia Name Instances total
1.109 1.151 0.743 1.102 4.105 sec
GDP per capita Inflation Flag Name Instances total
1.142 2.228 0.867 1.108 1.136 6.481 sec
31
Variations senators flags versus birth places
32
Timeline of manned spaceflight
33
Scatterplot of inflation vs. GDP
34
Precision / Recall
Senators image
Correct Incorrect
86 6 Accepted
0 6 Rejected
Senators state latitude
Correct Incorrect
64 34 Accepted
1 0 Rejected
Countries gdp per capita
Correct Incorrect
206 58 Accepted
9 0 Rejected
35
Summary
  • Visualize heterogeneous data represented as a
    graph of relationships between objects
  • Produce visualizations conforming to templates by
    searching for needed attributes
  • Present algorithm for finding and ranking paths
    based on keywords
  • Efficiently enumerate paths
  • Rank
  • Now fast enough for interactive use
  • High precision and recall

36
Future work
  • Improvements
  • UI support for initial discovery and query
    refinement
  • Robustness of terms / Improved ranking
  • Automatic selection of visualization
  • Visualizing missing data
  • Visualizations that reflect result relevance
    (selective emphasis)
  • Deploy on the web
  • Wikipedia
  • The whole web

37
Acknowledgements
  • Funding sources
  • Boeing
  • RVAC
  • CALO
  • Tools and data
  • DBpedia
  • MIT SIMILE project timeline
  • Tom Pattersons map artwork

38
The end!
39
Pruning techniquesBidirectional Search
  • Before A, search one step back from each
    literal,following only edges that match
    keywords
  • This saves one step during forward A search

latitude
42.4
capital
governor
children
4
pop
state
6349000
birthplace
spouse
latitude
39.0
DianneFeinstein
party
HarryReid
name
houseleader
color
blue
40
Need for multiple paths
41
Need for multiple paths
Write a Comment
User Comments (0)
About PowerShow.com