Title: Cui Tao
1Ontology Generation, Information Harvesting and
Semantic Annotation For Machine-Generated Web
Pages
- Cui Tao
- PhD Dissertation Defense
2Motivation
- Birth date of my great grandpa
- Price and mileage of red Nissans, 1990 or newer
- Protein and amino acids information of gene
cdk-4? - US states with property crime rates above 1
3Search by Search Engine
4Search the Hidden Web
- The Hidden Web
- Hidden behind forms
- Hard to query
5Query for Data
- The Hidden Web
- Hidden behind forms
- Hard to query
Find the protein and the animo-acids
information for gene cdk-4"
6A Web of Pages ? A Web of Knowledge
- Web of Knowledge
- Machine-understandable
- Publicly accessible
- Queriable by standard query languages
- Semantic annotation
- Domain ontologies
- Populated conceptual model
- Problems to resolve
- How do we create ontologies?
- How do we annotate pages for ontologies?
7Contributions of Dissertation Work
- Web of Pages ? Web of Knowledge
- Knowledge meta-knowledge extraction
- Reformulation as machine-understandable
knowledge - Automatic semi-automatic solutions via
- Sibling tables (TISP/TISP)
- User-created forms (FOCIH)
8Automatic Annotation with TISP(Table
Interpretation with Sibling Pages)
- Recognize tables (discard non-tables)
- Locate table labels
- Locate table values
- Find label/value associations
9Recognize Tables
Layout Tables (discard)
Data Table
Nested Data Tables
10Find Label/Value Associations
Example (Identification.Gene model(s).Protein,
Identification.Gene model(s).2) WPCE28918
1 2
11Interpretation TechniqueSibling Page Comparison
12Interpretation TechniqueSibling Page Comparison
Same
13Interpretation TechniqueSibling Page Comparison
Almost Same
14Interpretation TechniqueSibling Page Comparison
Different
Same
15Technique Details
- Unnest tables
- Match tables in sibling pages
- Perfect match (table for layout ? discard )
- Reasonable match (sibling table)
- Determine use table-structure pattern
- Discover pattern
- Pattern usage
- Dynamic pattern adjustment
16Table Unnesting
17Table Structure Patterns
- Regularity Expectations
- (lttrgtlt(tdth)gt L lt(tdth)gt V)n
- lttrgt(lt(tdth)gt L)n
- (lttrgt(lt(tdth)gt V)n)
-
Pattern combinations are also possible.
18Table Structure Patterns
lttrgt(lt(tdth)gt L)n (lttrgt(lt(tdth)gt V)n)
19Pattern Usage
20Dynamic Pattern Adjustment
21TISP
- Automatic ontology generation
- Automatic information annotation
22Ontology Generation OSM
- Object set table labels
- Lexical labels that associate with actual values
- Non-lexical labels that associate with other
tables - Relationship set table nesting
- Constraints updates based on observation
23Ontology Generation OWL
- Object set OWL class
- Relationship set OWL object property
- Lexical object set
- OWL data type property
- Different annotation properties to keep track of
the provenance
24Generated Ontology
25Generated Ontology
26RDF Graph
27Query the Data
Find the protein and the animo-acids
information for gene cdk-4"
28TISP Evaluation
- Applications
- Commercial car ads
- Scientific molecular biology
- Geopolitical US states and countries
- Data gt 2,000 tables in 35 sites
- Evaluation
- Initial two sibling pages
- Correct separation of data tables from layout
tables? - Correct pattern recognition?
- Remaining tables in site
- Information properly extracted?
- Able to detect and adjust for pattern variations?
29Experimental Results
- Table recognition correctly discarded 157 of 158
layout tables - Pattern recognition correctly found 69 of 72
structure patterns - Extraction and adjustments 5 path adjustments
and 34 label adjustments ? all correct
30TISP Performance
- Performance depends on TISP
- TISP test set
- Generates all ontologies correctly
- Annotates all information in tables correctly
31Form-based Ontology Creation and Information
Harvesting (FOCIH)
- Personalized ontology creation by form
- General familiarity
- Reasonable conceptual framework
- Appropriate correspondence
- Transformable to ontological descriptions
- Capable of accepting source data
- Automated ontology creation
- Automated information harvesting
32Form Creation
33Created Sample Form
34Generated Ontology View
35Source-to-Form Mapping
36Source-to-Form Mapping
37Source-to-Form Mapping
38Source-to-Form Mapping
39Almost Ready to Harvest
- Need reading path DOM-tree structure
- Need to resolve mapping problems
- Pattern recognition
- Instance recognition
40Reading Path
41Pattern Instance Recognition
42Pattern Instance Recognition
43Pattern Instance Recognition
44Pattern Instance Recognition
list pattern, delimiter is ,
45Pattern Instance Recognition
list pattern, delimiter is regular expression
for percentage numbers and a comma
46Pattern Instance Recognition
list pattern, delimiter is regular expression
for percentage numbers and a comma
47Can Now Harvest
48Can Now Harvest
49Can Now Harvest
50Semantic Annotation
51Semantic Annotation
52Semantic Annotation
53Semantic Annotation
54Semantic Annotation
55Semantic Query
56FOCIH Performance
- Ontology creation
- Semantic annotation
- Depends on TISP performance
- Depends on pattern and instance recognition
performance
57FOCIH Performance
- Pattern and instance recognition
- Works with highly regular data
- Tested 71 mappings
- 25 full-string values (25/25 correct)
- 38 substring values (29/38 correct)
- 8 list patterns (6/8 correct)
58FOCIH Difficulties
59FOCIH Difficulties
60FOCIH Difficulties
No selection
61WoK via TISP
62WoK via TISP
63WoK via FOCIH
64WoK via FOCIH
65Contributions
- TISP automatic sibling table interpretation
- TISP
- Automatic ontology generation based on
interpreted tables - Automatic semantic annotation for interpreted
tables - FOCIH
- Semi-automatic personalized ontology creation
- Automatic personalized information harvesting and
semantic annotation - All together contributes to turning the current
web of pages into a web of Knowledge
66Future Work
- Sibling pages in addition to sibling tables
- Reverse engineer from ontologies to forms as a
basis for information harvesting for already
defined ontologies.