Cui Tao - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Cui Tao

Description:

Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages Cui Tao PhD Dissertation Defense * * Future Work Sibling pages in ... – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 67
Provided by: Cui52
Category:
Tags: cui | harvesting | machine | tao

less

Transcript and Presenter's Notes

Title: Cui Tao


1
Ontology Generation, Information Harvesting and
Semantic Annotation For Machine-Generated Web
Pages
  • Cui Tao
  • PhD Dissertation Defense

2
Motivation
  • Birth date of my great grandpa
  • Price and mileage of red Nissans, 1990 or newer
  • Protein and amino acids information of gene
    cdk-4?
  • US states with property crime rates above 1

3
Search by Search Engine
4
Search the Hidden Web
  • The Hidden Web
  • Hidden behind forms
  • Hard to query

5
Query for Data
  • The Hidden Web
  • Hidden behind forms
  • Hard to query

Find the protein and the animo-acids
information for gene cdk-4"
6
A Web of Pages ? A Web of Knowledge
  • Web of Knowledge
  • Machine-understandable
  • Publicly accessible
  • Queriable by standard query languages
  • Semantic annotation
  • Domain ontologies
  • Populated conceptual model
  • Problems to resolve
  • How do we create ontologies?
  • How do we annotate pages for ontologies?

7
Contributions of Dissertation Work
  • Web of Pages ? Web of Knowledge
  • Knowledge meta-knowledge extraction
  • Reformulation as machine-understandable
    knowledge
  • Automatic semi-automatic solutions via
  • Sibling tables (TISP/TISP)
  • User-created forms (FOCIH)

8
Automatic Annotation with TISP(Table
Interpretation with Sibling Pages)
  • Recognize tables (discard non-tables)
  • Locate table labels
  • Locate table values
  • Find label/value associations

9
Recognize Tables
Layout Tables (discard)
Data Table
Nested Data Tables
10
Find Label/Value Associations
Example (Identification.Gene model(s).Protein,
Identification.Gene model(s).2) WPCE28918
1 2
11
Interpretation TechniqueSibling Page Comparison
12
Interpretation TechniqueSibling Page Comparison
Same
13
Interpretation TechniqueSibling Page Comparison
Almost Same
14
Interpretation TechniqueSibling Page Comparison
Different
Same
15
Technique Details
  • Unnest tables
  • Match tables in sibling pages
  • Perfect match (table for layout ? discard )
  • Reasonable match (sibling table)
  • Determine use table-structure pattern
  • Discover pattern
  • Pattern usage
  • Dynamic pattern adjustment

16
Table Unnesting
17
Table Structure Patterns
  • Regularity Expectations
  • (lttrgtlt(tdth)gt L lt(tdth)gt V)n
  • lttrgt(lt(tdth)gt L)n
  • (lttrgt(lt(tdth)gt V)n)

Pattern combinations are also possible.
18
Table Structure Patterns
lttrgt(lt(tdth)gt L)n (lttrgt(lt(tdth)gt V)n)
19
Pattern Usage
20
Dynamic Pattern Adjustment
21
TISP
  • Automatic ontology generation
  • Automatic information annotation

22
Ontology Generation OSM
  • Object set table labels
  • Lexical labels that associate with actual values
  • Non-lexical labels that associate with other
    tables
  • Relationship set table nesting
  • Constraints updates based on observation

23
Ontology Generation OWL
  • Object set OWL class
  • Relationship set OWL object property
  • Lexical object set
  • OWL data type property
  • Different annotation properties to keep track of
    the provenance

24
Generated Ontology
25
Generated Ontology
26
RDF Graph
27
Query the Data
Find the protein and the animo-acids
information for gene cdk-4"
28
TISP Evaluation
  • Applications
  • Commercial car ads
  • Scientific molecular biology
  • Geopolitical US states and countries
  • Data gt 2,000 tables in 35 sites
  • Evaluation
  • Initial two sibling pages
  • Correct separation of data tables from layout
    tables?
  • Correct pattern recognition?
  • Remaining tables in site
  • Information properly extracted?
  • Able to detect and adjust for pattern variations?

29
Experimental Results
  • Table recognition correctly discarded 157 of 158
    layout tables
  • Pattern recognition correctly found 69 of 72
    structure patterns
  • Extraction and adjustments 5 path adjustments
    and 34 label adjustments ? all correct

30
TISP Performance
  • Performance depends on TISP
  • TISP test set
  • Generates all ontologies correctly
  • Annotates all information in tables correctly

31
Form-based Ontology Creation and Information
Harvesting (FOCIH)
  • Personalized ontology creation by form
  • General familiarity
  • Reasonable conceptual framework
  • Appropriate correspondence
  • Transformable to ontological descriptions
  • Capable of accepting source data
  • Automated ontology creation
  • Automated information harvesting

32
Form Creation
33
Created Sample Form
34
Generated Ontology View
35
Source-to-Form Mapping
36
Source-to-Form Mapping
37
Source-to-Form Mapping
38
Source-to-Form Mapping
39
Almost Ready to Harvest
  • Need reading path DOM-tree structure
  • Need to resolve mapping problems
  • Pattern recognition
  • Instance recognition

40
Reading Path
41
Pattern Instance Recognition
42
Pattern Instance Recognition
43
Pattern Instance Recognition
44
Pattern Instance Recognition
list pattern, delimiter is ,
45
Pattern Instance Recognition
list pattern, delimiter is regular expression
for percentage numbers and a comma
46
Pattern Instance Recognition
list pattern, delimiter is regular expression
for percentage numbers and a comma
47
Can Now Harvest
48
Can Now Harvest
49
Can Now Harvest
50
Semantic Annotation
51
Semantic Annotation
52
Semantic Annotation
53
Semantic Annotation
54
Semantic Annotation
55
Semantic Query
56
FOCIH Performance
  • Ontology creation
  • Semantic annotation
  • Depends on TISP performance
  • Depends on pattern and instance recognition
    performance

57
FOCIH Performance
  • Pattern and instance recognition
  • Works with highly regular data
  • Tested 71 mappings
  • 25 full-string values (25/25 correct)
  • 38 substring values (29/38 correct)
  • 8 list patterns (6/8 correct)

58
FOCIH Difficulties
59
FOCIH Difficulties
60
FOCIH Difficulties
No selection
61
WoK via TISP
62
WoK via TISP
63
WoK via FOCIH
64
WoK via FOCIH
65
Contributions
  • TISP automatic sibling table interpretation
  • TISP
  • Automatic ontology generation based on
    interpreted tables
  • Automatic semantic annotation for interpreted
    tables
  • FOCIH
  • Semi-automatic personalized ontology creation
  • Automatic personalized information harvesting and
    semantic annotation
  • All together contributes to turning the current
    web of pages into a web of Knowledge

66
Future Work
  • Sibling pages in addition to sibling tables
  • Reverse engineer from ontologies to forms as a
    basis for information harvesting for already
    defined ontologies.
Write a Comment
User Comments (0)
About PowerShow.com