Title: TANGO%20Table%20ANalysis%20for%20Generating%20Ontologies
1TANGOTable ANalysis for Generating Ontologies
- Yuri A. Tijerino,
- David W. Embley,
- Deryle W. Lonsdale and
- George Nagy
- Brigham Young University
- Rensselaer Polytechnic Institute
2List of contents
- Motivation
- Applications
- Table understanding
- Concept matching
- Ontology merging/growing
- Example
- Future direction
3Motivation
- Semi-automated ontological engineering through
Table Analysis for Generating Ontologies (TANGO) - Keyword or link analysis search not enough to
search for information in tables - Structure in tables can lead to domain knowledge
which includes concepts, relationships and
constraints (ontologies) - Tables on web created for human use can lead to
robust domain ontologies
4TANGO Applications
- Extraction ontologies (generation)
- Data integration
- Semantic web
- Multiple-source query processing
- Document image analysis for documents that
contain tables
5Table understanding
- What is a table?
- Why table normalization?
- What is table understanding?
- What is mini-ontology generation?
6Table understandingWhat is a table?
- a two-dimensional assembly of cells used to
present information - Lopresti and Nagy
- Normalized tables (row-column format)
- Small paper (using OCR) and/or electronic tables
(marked up) intended for human use
7Table understandingWhat is table normalization?
Raw table
- Table normalization means to take any table and
produce a standard row-column table with all data
cells containing expanded values and type
information
Country GDP/PPP GDP/PPP Per Capita Real- Growth Rate Inflation
Afghanistan 21,000,000,000 800 ? ?
Albania 13,200,000,000 3,800 7.3 3.0
Algeria 177,000,000,000 5,600 3.8 3.0
Andorra 1,300,000,000 19,000 3.8 4.3
Angola 13,300,000,000 1,330 5.4 110.0
Antigua and Barbuda 674,000,000 10,000 3.5 0.4
Normalized table
8Table understandingWhat is table normalization?
9Table understandingWhat is table normalization?
?? Population Population Growth rate Population Density Birth Rate Death Rate Migration Rate Life Expectancy Male Life Expectancy Female Infant Mortality
Afghanistan 25,824,882 3.95 39.88 persons/km2 4.19 1.70 1.46 47.82 years 46.82 years 14.06
Albania 3,364,571 1.05 122.79 persons/km2 2.07 0.74 -0.29 65.92 years 72.33 years 4.29
Algeria 31,133,486 2.10 13.07 persons/km2 2.70 0.55 -0.05 68.07 years 70.46 years 4.38
American Samoa 63,786 2.64 320.53 persons/km2 2.65 0.40 0.39 71.23 years 79.95 years 1.02
Andorra 65,939 2.24 146.53 persons/km2 1.03 0.55 1.76 80.55 years 86.55 years 0.41
Angola 11,510 2.84 8.97 persons/km2 4.31 1.64 0.16 46.08 years 50.82 years 12.92
Western Sahara 239,333 2.34 0.90 persons/km2 4.54 1.66 -0.54 47.98 years 50.57 years 13.67
World 5,995,544,836 1.30 14.42 persons/km2 2.20 0.90 ? 61.00 years 65.00 years 5.60
Yemen 16,942,230 3.34 32.09 persons/km2 4.33 0.99 0.00 58.17 years 61.88 years 6.98
Zambia 9,663,535 2.12 13.05 persons/km2 4.45 2.26 0.08 36.72 years 37 21 years 9.19
Zimbabwe 11,163,160 1.02 28.87 persons/km2 3.06 2.04 ? 38.77 years 38.94 years 6.12
10Table understandingInformation useful for
normalization
- Captions in vicinity of table (above, below
etc) - Footnotes on annotated column labels or data
cells - Embedded information in rows, columns or cells
e.g., , , (1,000), billions, etc - Links to other views of the table, possibly with
new information
11What is table understanding?
- Normalize table
- Take a table as an input and produce standard
records in the form of attribute-value pairs as
output - Discover constraints among columns
- Understand the data values
ltCountry Afghanistangt, ltGDP/PPP
21,000,000,000gt, ltGDP/PPP per capita 800gt,
ltReal-growth rate ?gt, ltInflation ?gt
has(Country, GDP/PPP),has(Country,GDP/PPP Per
Capita), has(Country,Real-growth rate),
has(Country, Inflation)
Left-most, primary key
Country GDP/PPP GDP/PPP Per Capita Real-Growth Rate Inflation
Afghanistan 21,000,000,000 800 ? ?
Albania 13,200,000,000 3,800 7.3 3.0
Algeria 177,000,000,000 5,600 3.8 3.0
Andorra 1,300,000,000 19,000 3.8 4.3
Angola 13,300,000,000 1,330 5.4 110.0
Antigua and Barbuda 674,000,000 10,000 3.5 0.4
Dollar amount (from data frame)
Percentage (from data frame)
Country names (from data frame)
12ExampleCreating a domain ontology
Longitude
Latitude
Latitude and longitude designates location
Distances
Name
Geopolitical Entity
Location
Includes procedural knowledge
has
names
Has GMT
Duration between Time zones
Time
Country
City
Has associated data frames
13ExampleTable understanding to mini-ontology
generation
Agglomeration Population Continent Country
Tokyo 31,139,900 Asia Japan
New York-Philadelphia 30,286,900 The Americas United States of America
Mexico 21,233,900 The Americas Mexico
Seoul 19,969,100 Asia Korea (South)
Sao Paulo 18,847,400 The Americas Brazil
Jakarta 17,891,000 Asia Indonesia
Osaka-Kobe-Kyoto 17,621,500 Asia Japan
Niigata 503,500 Asia Japan
Raurkela 503,300 Asia India
Homjel 502,200 Europe Belarus
Zunyi 501,900 Asia China
Santiago 501,800 The Americas Dominican Republic
Pingdingshan 501,500 Asia China
Fargona 501,000 Asia Uzbekistan
Kirov 500,200 Europe Russia
Newcastle 500,000 Australia /Oceania Australia
14ExampleConcept matching to ontology Merging
Merge
Results
Has GMT
Has GMT
15Concept matching
- We use exhaustive concept matching techniques to
match concepts from different mini-ontologies,
including - Lexical and Natural Language Processing
- Value Similarity
- Value Features
- Data Frame Comparison
- Constraints
16Concept Matching (Lexical NLP)
- Lexical
- Direct comparisons (substring/superstring)
- WordNet (Synonyms, Word Senses,
Hypernyms/Hyponyms) - Natural Language Processing
- Phrases in column headers
- Footnotes (for columns, rows, values)
- Explanations of symbols, rows, columns
- Titles and subtitles
17Concept Matching (Value Similarity)
- Compute overlap for string values comparing data
sets - Compute overlap for numeric values comparing
Gaussian Probability Distributions - Compute similarity of numeric values using
regression
18Concept Matching (Value Similarity)
Afghanistan
Albania
Algeria
Andorra
Yemen
Zambia
Zimbabwe
Afghanistan
Albania
Algeria
American Samoa
World
Yemen
Zambia
Zimbabwe
Real-world example Total of 193 cells in A Total
of 267 cells in B 77 fields in B not in A 3
fields in A not in B 190 total
matches Proportion of matches with respect to A
190/193 98 Proportion of matches
with respect to B 190/267 71
In B not in A
In A not in B
In B not in A
A
B
19Concept Matching (Value Similarity)
Gaussian PDF
31,900,600
30,521,550
25,335,200
12,300,555
3,567,203
2,300,531
1,400,112
31,500,900
30,400,111
25,500,100
21,000,900
7,000,000
3,500,050
2,300,000
1,500,000
Total of 170 cells in A Total of 240 cells in
B 50 fields in B not in A 2 fields in A not in
B 168 total matches Proportion of matches
with respect to A 168/170 99 Proportion of
matches with respect to B 168/240 70
In B not in A
In A not in B
In B not in A
A
B
20Concept Matching (Value Features)
- We can also compute similarities from value
characteristics such as - Character/numeric length, ratio
- Numeric values mean, variance, standard deviation
21Concept Matching (Data frames)
- Snippets of real-world knowledge about data
(type, length, nearby keywords, patterns as in
regexps, functional, etc) - We have used data frames to
- Recognize data types
- Include recognizers for values (dates, times,
longitude, latitude, countries, cities, etc) - Provide conversion routines
- Match headers, labels, footnotes and values
- Compose or split columns (e.g., addresses)
22Concept Matching (Constraints)
- Keys in tables (as well as nonkeys)
- Functional relationships
- 1-1, 1-, -1 or - correspondences
- Subset/superset of value sets
- Unknown and null values
23Ontology merging/growing
- Direct merge (no conflicts)
- Use results of matching phase to find similar
concepts in ontologies (e.g., data value
similarities, data frames, NLP, etc) - Conflict resolution
- Interactively identify evidence and counter
evidence of functional relationships among
mini-ontologies using constraint resolution - IDS Interaction with human knowledge engineer
- Issues identify
- Default strategy apply
- Suggestions make
24Example Another mini-ontology generation
25Example Another mini-ontology generation
Merge
Longitude
Latitude
Population
Latitude and longitude designates location
Location
Name
Geopolitical Entity
has
names
has GMT
Time
City
Agglomeration
Country
Continent
26Example Concept Mapping to Ontology Merging
Longitude
Latitude
Population
Latitude and longitude designates location
Location
Name
Geopolitical Entity
has
names
has GMT
Time
Geopolitical Entity with population
Elevation
USGS Quad
State
Place
Area
?
Country
Lake
Agglomeration
Country
Continent
City/town
Mine
Reservoir
27Future direction
- Start with multiple tables (or URLs) and generate
mini-ontologies - Identify most suitable mini-ontologies to merge
by calculating which tables have most overlap of
concepts - Generate multiple domain ontologies
- Integrate with form-based data extraction tools
(smarter Web search engines)