Title: David W' Embley
1WoK A Web of Knowledge
- David W. Embley
- Brigham Young University
- Provo, Utah, USA
2A Web of Pages ? A Web of Facts
- Birthdate of my great grandpa Orson
- Price and mileage of red Nissans, 1990 or newer
- Location and size of chromosome 17
- US states with property crime rates above 1
3Toward a Web of Knowledge
- Fundamental questions
- What is knowledge?
- What are facts?
- How does one know?
- Philosophy
- Ontology
- Epistemology
- Logic and reasoning
4Ontology
- Existence ? asks What exists?
- Concepts, relationships, and constraints with
formal foundation
5Epistemology
- The nature of knowledge ? asks What is
knowledge? and How is knowledge acquired? - Populated conceptual model
6Logic and Reasoning
- Principles of valid inference asks What is
known? and What can be inferred? - For us, it answers what can be inferred (in a
formal sense) from conceptualized data.
Find price and mileage of red Nissans, 1990 or
newer
7Making this Work ? How?
- Distill knowledge from the wealth of digital web
data - Annotate web pages
- Need a computational alembic to algorithmically
turn raw symbols contained in web pages into
knowledge
Annotation
Annotation
Fact
Fact
Fact
8Turning Raw Symbols into Knowledge
- Symbols 11,500 117K Nissan CD AC
- Data price(11,500) mileage(117K)
make(Nissan) - Conceptualized data
- Car(C123) has Price(11,500)
- Car(C123) has Mileage(117,000)
- Car(C123) has Make(Nissan)
- Car(C123) has Feature(AC)
- Knowledge
- Correct facts
- Provenance
9Actualization (with Extraction Ontologies)
Find me the price and mileage of all red Nissans
I want a 1990 or newer.
10Data Extraction Demo
11Semantic Annotation Demo
12Free-Form Query Demo
13Explanation How it Works
- Extraction Ontologies
- Semantic Annotation
- Free-Form Query Interpretation
14Extraction Ontologies
Object sets Relationship sets Participation
constraints Lexical Non-lexical Primary object
set Aggregation Generalization/Specialization
15Extraction Ontologies
Data Frame
Internal Representation float
Values
External Rep. \s\s(\d1,3)(\.\d2)?
Left Context
Key Word Phrase
Key Words (Pprice)(Ccost)
Operators
Operator gt
Key Words (more\sthan)(more\scostly)
16Generality Resiliency ofExtraction Ontologies
- Generality assumptions about web pages
- Data rich
- Narrow domain
- Document types
- Single-record documents (hard, but doable)
- Multiple-record documents (harder)
- Records with scattered components (even harder)
- Resiliency declarative
- Still works when web pages change
- Works for new, unseen pages in the same domain
- Scalable, but takes work to declare the
extraction ontology
17Semantic Annotation
18Free-Form Query Interpretation
- Parse Free-Form Query
- (with respect to data extraction ontology)
- Select Ontology
- Formulate Query Expression
- Run Query Over Semantically Annotated Data
19Parse Free-Form Query
Find me the and of all
s I want a
price
mileage
red
Nissan
1996
or newer
gt Operator
20Select Ontology
Find me the price and mileage of all red Nissans
I want a 1996 or newer
21Formulate Query Expression
- Conjunctive queries and aggregate queries
- Mentioned object sets are all of interest.
- Values and operator keywords determine
conditions. - Color red
- Make Nissan
- Year gt 1996
gt Operator
22Formulate Query Expression
For
Let
Where
Return
23Run QueryOver Semantically Annotated Data
24Great!But Problems Still Need Resolution
- Automating content annotation
- Extraction-ontology creation a few dozen person
hours - Semi-automatic creation
- FOCIH (Form-based Ontology Creation and
Information Harvesting) - TISP (Table Interpretation by Sibling Pages)
- TANGO (Table ANalysis for Generating Ontologies)
- Stepping up to the envisioned Web of Knowledge
- Current future work
- More challenging annotation projects
- Semi-automatic annotation via synergistic
bootstrapping - Knowledge bundles for research studies
- Practicalities
25Manual Creation
26Manual Creation
27Manual Creation
- Library of instance recognizers
- Library of lexicons
28Craigs List Alerter
- Constructed as a short class project
- Nine applications
- A few dozen hours
- Demo
29Alert! Alert!I found your Jeep Liberty for under
8,000.
7,995
Toll free 1-800-423-0334
30FOCIH Form-based Ontology Creation and
Information Harvesting
- Forms (general familiarity)
- Information Harvesting
- Semi-automatic extraction ontology creation
- Form-based generation of conceptual model
- Instance-recognizer creation
- Lexicons
- Some pre-existing instance recognizers
31FOCIH Form Creation
32FOCIH Ontology Generation
33FOCIH Information Harvesting
34FOCIHInformation-Harvesting Demo
35TISPTable Interpretation with Sibling Pages
36Interpretation TechniqueSibling Page Comparison
Same
37Interpretation TechniqueSibling Page Comparison
Almost Same
38Interpretation TechniqueSibling Page Comparison
Different
Same
39Technique Details
- Unnest tables
- Match tables in sibling pages
- Perfect match (table for layout ? discard )
- Reasonable match (sibling table)
- Determine use table-structure pattern
- Discover pattern
- Pattern usage
- Dynamic pattern adjustment
40Table Unnesting
41Simple Tree Matching Algorithm
Yang91
Match Score Categorization Exact/Near-Exact,
Sibling-Table, False
42Table Structure Patterns
- Regularity Expectations
- (lttrgtlt(tdth)gt L lt(tdth)gt V)n
- lttrgt(lt(tdth)gt L)n
- (lttrgt(lt(tdth)gt V)n)
-
Pattern combinations are also possible.
43Pattern Usage
(Location.Genetic Position) X12.69 /- 0.000
cM mapping data (Location.Genomic Position)
X13518823..13515773 bp
44Dynamic Pattern Adjustment
lttrgt(lt(tdth)gt L)5 (lttrgt(lt(tdth)gt V)5)
lttrgt(lt(tdth)gt L)5 (lttrgt(lt(tdth)gt V)5)
lttrgt(lt(tdth)gt L)6 (lttrgt(lt(tdth)gt V)6)
45TISP Demo
46TISP/FOCIHExtraction Ontology Creation
- Reverse engineer with TISP
- Adjust with FOCIH
- Data frames
- Initialize lexicons with harvested data
- Library of data framesselect and specialize
47TISP/FOCIHExtraction Ontology Creation
48TISP/FOCIHExtraction Ontology Creation
49TISP/FOCIHExtraction Ontology Creation
50TISP/FOCIHExtraction Ontology Creation
51TISP/FOCIHExtraction Ontology Creation
52TISP/FOCIHExtraction Ontology Creation
53TANGOTable Analysis for Generating Ontologies
- Recognize and normalize table information
- Construct mini-ontologies from tables
- Discover inter-ontology mappings
- Merge mini-ontologies into a growing ontology
54Recognize Table Information
Religion
Population Albanian
Roman Shia
Sunni Country (July 2001 est.) Orthodox
Muslim Catholic Muslim Muslim
other Afganistan 26,813,057
15
84 1 Albania
3,510,484 20 70 10
55Construct Mini-Ontology
56Discover Mappings
57Merge
58TANGO Demo
59Some More Challenging Annotation Applications
- Multimedia Annotation
- Art Images
- Music and Lyrics
- Closed-captioning Video
- Historical Document Images
- Names, Dates, Places, Events
- Learned rules for OCRd Named Entity Recognition
- Open Question Answering
60Find me an image that is red, dark, scary, and
beautiful.
61Find something soothing but energeticGood for
recovering hospital patients.
How about Mozarts 40th Symphony? (Here it is.)
62U.S. President Barack Obama visited Iraq Monday
in a stop that was overshadowed by the question
of when U.S. troops should go home. Obama made
his opposition to the U.S.-led invasion of Iraq
five years ago a centerpiece of his campaign and
was in Baghdad to assess security in Iraq, where
violence has fallen to its lowest level since
early 2004. When has Barack Obama visited
Iraq? Which U.S. Presidents have visited Iraq?
63Find names, locations, events, and dates and
associations among them for my great grandma
Margaret Haines.
I GERTRUDE SMITH (Mrs William E Haines deceased)
Married shortly after graduation Died at age of
22 Was musician and taught piano lessons 1898
HOBART L BENEDICT Millburn Essex County N J
Graduated from Rutgers 1902 and from New York Law
School in 1904 with degrees of B Sc M Sc and LL B
Married April 9 1907 to Martha C Bunnell One
daughter Elizabeth Benedict Counsellor at law
with offices in Elizabeth and Millburn MARTHA
BUNNELL (Mrs Hobart L Benedict) Millburn Essex
County N J Married to Hobart L Benedict on date
above 1899 CORA SMITH (Mrs Louis Slingerland) 557
Third St South St Peters- burg Florida Married
Louis Slingerland a former pupil of Connec- Farms
High School Mr Slingerland is engaged in building
business in St Petersburg JENNIE HAINES Elmwood
Ave Union Union Co N J Graduated from State
Normal School Trenton N J in 190 5 Principal of
Hurden Looker School in Hillside Township
formerly a part of Union Town- ship STELLA
ILLSLEY (Mrs Harry Engel) Hollis Long Island N Y
WALTER BOSCHEN Morris Ave Union N J Completed
fourth year at Battin High School in 1900
Attended Rutgers College taking up civil
engineering course Has been successful in the
business world President of the W G Boschen Sales
Co Inc manufacturer general agents for mechanical
line GEORGE McQUAIDE Springfield N J Was employed
by Morris County Traction Company 1900 No
graduates 1901 No graduates 1902 MARGARET HAINES
Elmwood Ave Union N J Took up stenography and
typewriting and is now employed as private
secretary of the Correspondence Department of the
Singer Manufacturing Com- pany of Elizabeth N J
ABBY HEADLEY (Mrs Leslie Ward) 5 Rose St Newark N
J CLARENCE GRIGGS Stuyvesant Ave Union N J
Graduated from Trenton State Normal School in
1905 having specialized in manual training Taught
one half year at Neshanic N J one year at Lin-
coin School Roselle N J Teaching manual
39training and mechanical drawing in Newark N
J Has taken special courses in Columbia
University 34
64Learn rules to recognize names, even under
less-than-ideal OCRd documents.
- Seed models
- Prefix Mrs, Miss, Mr
- Initials A, B, C,
- Given Name Charles, Francis, Herbert
- Surname Goodrich, Wells, White
- Stopword Jewell, Graves
- Updates
- Prefix first token in line
- Given Name between Prefix and Initial
- Surname between initial and lt/Sgt
M RS CHARLES A JEWELL MRS FRANCIS B COOI
EN MRS P W ELILSWVORT MRs HERBERT C
ADSWVORTH MRS HENRY E TAINTOR MR DANIEl H
WELLS MRS ARTHUR L GOODRICH Miss JOSEPHINE
WHITE Mss JULIA A GRAVES Ms H B LANGDON Miss
MARY H ADAMS Miss ELIZA F Mix 'MRs MIARY C ST
)NEC MIRS AI I ERT H PITKIN
65Who was the first person to land on the Moon?
66Semi-Automatic Annotation viaSynergistic
Bootstrapping(Based on Nested Schemas with
Regular Expressions)
- Build a page-layout, pattern-based annotator
- Automate layout recognition based on examples
- Auto-generate examples with extraction ontologies
- Synergistically run pattern-based annotator
extraction-ontology annotator
67PatML Editor
Information Structure Tree
Page Source Text
Browser-Rendered Page
68(No Transcript)
69Synergistic Execution
Extraction Ontology
Partially Annotated Document
Conceptual Annotator (ontology-based annotation)
Pattern Generation
Document
Layout Patterns
Annotated Document
Structural Annotator (layout-driven annotation)
70Knowledge Bundles forResearch Studies
True Story
To do a recent study about associations between
lung cancer and tp53 polymorphism, researchers
needed to (1) do a keyword-based search on the
SNP data repository for tp53'' within organism
"homo sapiens" (2) from the returned records,
open each record page one by one and find those
coding SNPs that have a minor allele frequency
greater than 1 (3) for each qualifying SNP,
record the SNP ID and many properties of the SNP
(4) perform a keyword search in PubMed and skim
the hundreds of manuscripts found to determine
which manuscripts are related to the SNPs of
interest and fit their search criteria (5)
extract the information of interest (e.g., the
statistical information, patient information, and
treatment information) and (6) organize it.
71Knowledge Bundles forResearch Studies
(1) Search, (2) Filter, (3) Record information
72Knowledge Bundles forResearch Studies
(4) High precision literature search
73Knowledge Bundles forResearch Studies
(5) Extract by reverse engineering
74Knowledge Bundles forResearch Studies
(5) Organize harvested information
75Knowledge Bundles forResearch Studies
76Knowledge Bundles forResearch Studies
Research Challenge I believe that a good
biomedical scenario would be to select a topic
which already large structured database (gene
extraction, vitamins, blood), and then search for
and find web pages that augment, support or
refute specific aspects of that
database. GN
77Practicalities Bootstrapping the WoK
(Future Work)
- Wont just happen without sufficient content
- Niche applications
- Historical Data (e.g. Genealogy)
- Bio-research studies
- Local WoKs
- Intra-organizational effort
- Individual interests
78Practicalities Scalability
(Future Work)
- Potential Rapid growth
- Thousands of ontologies
- Millions of simultaneous queries
- Billions of annotated pages
- Trillions of facts
- Search-engine-like caching query processing
79Key to SuccessSimplicity via Automation
- Automatic (or near automatic) creation of
extraction ontologies - Automatic (or near automatic) annotation of web
pages - Simple but accurate query specification without
specialized training
www.deg.byu.edu www.tango.byu.edu