David W. Embley - PowerPoint PPT Presentation

About This Presentation
Title:

David W. Embley

Description:

WoK: A Web of Knowledge David W. Embley Brigham Young University Provo, Utah, USA – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 78
Provided by: Davi1925
Learn more at: https://tango.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: David W. Embley


1
WoK A Web of Knowledge
  • David W. Embley
  • Brigham Young University
  • Provo, Utah, USA

2
A Web of Pages ? A Web of Facts
  • Birthdate of my great grandpa Orson
  • Price and mileage of red Nissans, 1990 or newer
  • Location and size of chromosome 17
  • US states with property crime rates above 1

3
Toward a Web of Knowledge
  • Fundamental questions
  • What is knowledge?
  • What are facts?
  • How does one know?
  • Philosophy
  • Ontology
  • Epistemology
  • Logic and reasoning

4
Ontology
  • Existence ? asks What exists?
  • Concepts, relationships, and constraints with
    formal foundation

5
Epistemology
  • The nature of knowledge ? asks What is
    knowledge? and How is knowledge acquired?
  • Populated conceptual model

6
Logic and Reasoning
  • Principles of valid inference ? asks What is
    known? and What can be inferred?
  • For us, it answers what can be inferred (in a
    formal sense) from conceptualized data.

Find price and mileage of red Nissans, 1990 or
newer
7
Making this Work ? How?
  • Distill knowledge from the wealth of digital web
    data
  • Annotate web pages
  • Need a computational alembic to algorithmically
    turn raw symbols contained in web pages into
    knowledge

Annotation
Annotation


Fact
Fact
Fact
8
Turning Raw Symbols into Knowledge
  • Symbols 11,500 117K Nissan CD AC
  • Data price(11,500) mileage(117K)
    make(Nissan)
  • Conceptualized data
  • Car(C123) has Price(11,500)
  • Car(C123) has Mileage(117,000)
  • Car(C123) has Make(Nissan)
  • Car(C123) has Feature(AC)
  • Knowledge
  • Correct facts
  • Provenance

9
Actualization (with Extraction Ontologies)
Find me the price and mileage of all red Nissans
I want a 1990 or newer.
10
Data Extraction Demo
11
Semantic Annotation Demo
12
Free-Form Query Demo
13
Explanation How it Works
  • Extraction Ontologies
  • Semantic Annotation
  • Free-Form Query Interpretation

14
Extraction Ontologies
Object sets Relationship sets Participation
constraints Lexical Non-lexical Primary object
set Aggregation Generalization/Specialization
15
Extraction Ontologies
Data Frame
Internal Representation float
Values
External Rep. \s\s(\d1,3)(\.\d2)?
Left Context
Key Word Phrase
Key Words (Pprice)(Ccost)
Operators
Operator gt
Key Words (more\sthan)(more\scostly)
16
Generality Resiliency ofExtraction Ontologies
  • Generality assumptions about web pages
  • Data rich
  • Narrow domain
  • Document types
  • Single-record documents (hard, but doable)
  • Multiple-record documents (harder)
  • Records with scattered components (even harder)
  • Resiliency declarative
  • Still works when web pages change
  • Works for new, unseen pages in the same domain
  • Scalable, but takes work to declare the
    extraction ontology

17
Semantic Annotation
18
Free-Form Query Interpretation
  • Parse Free-Form Query
  • (with respect to data extraction ontology)
  • Select Ontology
  • Formulate Query Expression
  • Run Query Over Semantically Annotated Data

19
Parse Free-Form Query
Find me the and of all
s I want a

price
mileage
red
Nissan
1996
or newer
gt Operator
20
Select Ontology
Find me the price and mileage of all red Nissans
I want a 1996 or newer
21
Formulate Query Expression
  • Conjunctive queries and aggregate queries
  • Projection on mentioned object sets
  • Selection via values and operator keywords
  • Color red
  • Make Nissan
  • Year gt 1996

gt Operator
22
Formulate Query Expression
For
Let
Where
Return
23
Run QueryOver Semantically Annotated Data
24
Great!But Problems Still Need Resolution
  • How do we create extraction ontologies?
  • Manual creation requires several dozen person
    hours
  • Semi-automatic creation
  • TISP (Table Interpretation by Sibling Pages)
  • TANGO (Table ANalysis for Generating Ontologies)
  • Nested Schemas with Regular Expressions
  • Synergistic Bootstrapping
  • Form-based Information Harvesting
  • How do we scale up?
  • Practicalities of technology transfer and usage
  • Millions of queries over zillions of facts for
    thousands of ontologies

25
Manual Creation
26
Manual Creation
27
Manual Creation
  • Library of instance recognizers
  • Library of lexicons

28
Automatic Annotation with TISP(Table
Interpretation with Sibling Pages)
  • Recognize tables (discard non-tables)
  • Locate table labels
  • Locate table values
  • Find label/value associations

29
Recognize Tables
Layout Tables (discard)
Data Table
Nested Data Tables
30
Locate Table Labels
Examples Identification.Gene
model(s).Protein Identification.Gene model(s).2
31
Locate Table Labels
Examples Identification.Gene model(s).Gene
Model Identification.Gene model(s).2
1 2
32
Locate Table Values
Value
33
Find Label/Value Associations
Example (Identification.Gene model(s).Protein,
Identification.Gene model(s).2) WPCE28918
1 2
34
Interpretation TechniqueSibling Page Comparison
35
Interpretation TechniqueSibling Page Comparison
Same
36
Interpretation TechniqueSibling Page Comparison
Almost Same
37
Interpretation TechniqueSibling Page Comparison
Different
Same
38
Technique Details
  • Unnest tables
  • Match tables in sibling pages
  • Perfect match (table for layout ? discard )
  • Reasonable match (sibling table)
  • Determine use table-structure pattern
  • Discover pattern
  • Pattern usage
  • Dynamic pattern adjustment

39
Generated RDF
40
WoK Demo (via TISP)
41
Semi-Automatic Annotation with TANGO (Table
Analysis for Generating Ontologies)
  • Recognize and normalize table information
  • Construct mini-ontologies from tables
  • Discover inter-ontology mappings
  • Merge mini-ontologies into a growing ontology

42
Recognize Table Information

Religion
Population Albanian
Roman Shia
Sunni Country (July 2001 est.) Orthodox
Muslim Catholic Muslim Muslim
other Afganistan 26,813,057
15
84 1 Albania
3,510,484 20 70 10
43
Construct Mini-Ontology
44
Discover Mappings
45
Merge
46
Semi-Automatic Annotation viaSynergistic
Bootstrapping(Based on Nested Schemas with
Regular Expressions)
  • Build a page-layout, pattern-based annotator
  • Automate layout recognition based on examples
  • Auto-generate examples with extraction ontologies
  • Synergistically run pattern-based annotator
    extraction-ontology annotator

47
(No Transcript)
48
Synergistic Execution
Extraction Ontology
Partially Annotated Document
Conceptual Annotator (ontology-based annotation)
Pattern Generation
Document
Layout Patterns
Annotated Document
Structural Annotator (layout-driven annotation)
49
Form-Based Information Harvesting
  • Forms
  • General familiarity
  • Reasonable conceptual framework
  • Appropriate correspondence
  • Transformable to ontological descriptions
  • Capable of accepting source data
  • Instance recognizers
  • Some pre-existing instance recognizers
  • Lexicons
  • Automated extraction ontology creation?

50
Form Creation
  • Basic form-construction facilities
  • single-entry field
  • multiple-entry field
  • nested form

51
Created Sample Form
52
Generated Ontology View
53
Source-to-Form Mapping
54
Source-to-Form Mapping
55
Source-to-Form Mapping
56
Source-to-Form Mapping
57
Almost Ready to Harvest
  • Need reading path DOM-tree structure
  • Need to resolve mapping problems
  • Split/Merge
  • Union/Selection

58
Almost Ready to Harvest
  • Need reading path DOM-tree structure
  • Need to resolve mapping problems
  • Split/Merge
  • Union/Selection

Name
Voltage-dependent anion-selective channel
protein 3 VDAC-3 hVDAC3 Outer mitochondrial
membrane Protein porin 3
59
Almost Ready to Harvest
  • Need reading path DOM-tree structure
  • Need to resolve mapping problems
  • Split/Merge
  • Union/Selection

Name
Voltage-dependent anion-selective channel
protein 3 VDAC-3 hVDAC3 Outer mitochondrial
membrane Protein porin 3
60
Almost Ready to Harvest
  • Need reading path DOM-tree structure
  • Need to resolve mapping problems
  • Split/Merge
  • Union/Selection

Name
T-complex protein 1 subunit theta TCP-1-theta CCT-
theta Renal carcinoma antigen NY-REN-15
61
Almost Ready to Harvest
  • Need reading path DOM-tree structure
  • Need to resolve mapping problems
  • Split/Merge
  • Union/Selection

Name
T-complex protein 1 subunit theta TCP-1-theta CCT-
theta Renal carcinoma antigen NY-REN-15
62
Can Now Harvest
Name
63
Can Now Harvest
Name
14-3-3 protein epsilon Mitochondrial import
stimulation factor Lsubunit Protein kinase C
inhibitor protein-1 KCIP-1 14-3-3E
64
Can Now Harvest
Name
Voltage-dependent anion-selective channel
protein 3 VDAC-3 hVDAC3 Outer mitochondrial
membrane Protein porin 3
65
Can Now Harvest
Name
Tryptophanyl-tRNA synthetase, mitochondrial
precursor EC 6.1.1.2 TryptophantRNA
ligase TrpRS (Mt)TrpRS
66
Harvesting Populates Ontology
67
Harvesting Populates Ontology
Also helps adjust ontology constraints
68
Can Harvest from Additional Sites
Name
T-complex protein 1 subunit theta TCP-1-theta CCT-
theta Renal carcinoma antigen NY-REN-15
69
AutomatingExtraction Ontology Creation
Lexicons
Name
14-3-3 protein epsilon Mitochondrial import
stimulation factor Lsubunit Protein kinase C
inhibitor protein-1 KCIP-1 14-3-3E
14-3-3 protein epsilon Mitochondrial import
stimulation factor Lsubunit Protein kinase C
inhibitor protein-1 KCIP-1 14-3-3E T-complex
protein 1 subunit theta TCP-1-theta CCT-theta Rena
l carcinoma antigen NY-REN-15 Tryptophanyl-tRNA
synthetase, mitochondrial precursor EC
6.1.1.2 TryptophantRNA ligase TrpRS (Mt)TrpRS
Name
T-complex protein 1 subunit theta TCP-1-theta CCT-
theta Renal carcinoma antigen NY-REN-15
Name
Tryptophanyl-tRNA synthetase, mitochondrial
precursor EC 6.1.1.2 TryptophantRNA
ligase TrpRS (Mt)TrpRS
70
AutomatingExtraction Ontology Creation
Instance Recognizers
Number Patterns
Context Keywords and Phrases
71
Automatic Source-to-Form Mapping
72
Automatic Semantic Annotation
Recognize and annotate with respect to an ontology
73
Ontology Transformations
Transformations to and from all
74
Practicalities WoK Query Interfaces
(Future Work)
  • Advanced free-form queries with disjunction and
    negation
  • Form-based query language
  • Table-based query languages
  • Graphical query languages

75
Practicalities Bootstrapping the WoK
(Future Work)
  • Wont just happen without sufficient content
  • Niche applications
  • Historical Data (e.g. Genealogy)
  • Topical Blogs
  • Local WoKs
  • Intra-organizational effort
  • Individual interests

76
Practicalities Scalability
(Future Work)
  • Potential Rapid growth
  • Thousands of ontologies
  • Millions of simultaneous queries
  • Billions of annotated pages
  • Trillions of facts
  • Search-engine-like caching query processing

77
Key to SuccessSimplicity via Automation
  • Automatic (or near automatic) creation of
    extraction ontologies
  • Automatic (or near automatic) annotation of web
    pages
  • Simple but accurate query specification without
    specialized training

www.deg.byu.edu
Write a Comment
User Comments (0)
About PowerShow.com