Title: David W. Embley
1WoK A Web of Knowledge
- David W. Embley
- Brigham Young University
- Provo, Utah, USA
2A Web of Pages ? A Web of Facts
- Birthdate of my great grandpa Orson
- Price and mileage of red Nissans, 1990 or newer
- Location and size of chromosome 17
- US states with property crime rates above 1
3Find me an image that is red, dark, scary, and
beautiful.
4Learn rules to recognize names, even under
less-than-ideal OCRd documents.
- Seed models
- Prefix Mrs, Miss, Mr
- Initials A, B, C,
- Given Name Charles, Francis, Herbert
- Surname Goodrich, Wells, White
- Stopword Jewell, Graves
- Updates
- Prefix first token in line
- Given Name between Prefix and Initial
- Surname between initial and lt/Sgt
M RS CHARLES A JEWELL MRS FRANCIS B COOI
EN MRS P W ELILSWVORT MRs HERBERT C
ADSWVORTH MRS HENRY E TAINTOR MR DANIEl H
WELLS MRS ARTHUR L GOODRICH Miss JOSEPHINE
WHITE Mss JULIA A GRAVES Ms H B LANGDON Miss
MARY H ADAMS Miss ELIZA F Mix 'MRs MIARY C ST
)NEC MIRS AI I ERT H PITKIN
5Annotating Music and Lyrics
Find something soothing but energetic Good for
recovering patients. How about Mozarts 40th
Symphony?
6Build a knowledge bundle for checking the
association between tp53 polymorphism and lung
cancer.
7U.S. President Barack Obama visited Iraq Monday
in a stop that was overshadowed by the question
of when U.S. troops should go home. Obama made
his opposition to the U.S.-led invasion of Iraq
five years ago a centerpiece of his campaign and
was in Baghdad to assess security in Iraq, where
violence has fallen to its lowest level since
early 2004. When has Barack Obama visited
Iraq? Which U.S. Presidents have visited Iraq?
8Find names, locations, events, and dates and
associations among them for my great grandma
Margaret Haines.
I GERTRUDE SMITH (Mrs William E Haines deceased)
Married shortly after graduation Died at age of
22 Was musician and taught piano lessons 1898
HOBART L BENEDICT Millburn Essex County N J
Graduated from Rutgers 1902 and from New York Law
School in 1904 with degrees of B Sc M Sc and LL B
Married April 9 1907 to Martha C Bunnell One
daughter Elizabeth Benedict Counsellor at law
with offices in Elizabeth and Millburn MARTHA
BUNNELL (Mrs Hobart L Benedict) Millburn Essex
County N J Married to Hobart L Benedict on date
above 1899 CORA SMITH (Mrs Louis Slingerland) 557
Third St South St Peters- burg Florida Married
Louis Slingerland a former pupil of Connec- Farms
High School Mr Slingerland is engaged in building
business in St Petersburg JENNIE HAINES Elmwood
Ave Union Union Co N J Graduated from State
Normal School Trenton N J in 190 5 Principal of
Hurden Looker School in Hillside Township
formerly a part of Union Town- ship STELLA
ILLSLEY (Mrs Harry Engel) Hollis Long Island N Y
WALTER BOSCHEN Morris Ave Union N J Completed
fourth year at Battin High School in 1900
Attended Rutgers College taking up civil
engineering course Has been successful in the
business world President of the W G Boschen Sales
Co Inc manufacturer general agents for mechanical
line GEORGE McQUAIDE Springfield N J Was employed
by Morris County Traction Company 1900 No
graduates 1901 No graduates 1902 MARGARET HAINES
Elmwood Ave Union N J Took up stenography and
typewriting and is now employed as private
secretary of the Correspondence Department of the
Singer Manufacturing Com- pany of Elizabeth N J
ABBY HEADLEY (Mrs Leslie Ward) 5 Rose St Newark N
J CLARENCE GRIGGS Stuyvesant Ave Union N J
Graduated from Trenton State Normal School in
1905 having specialized in manual training Taught
one half year at Neshanic N J one year at Lin-
coin School Roselle N J Teaching manual
39training and mechanical drawing in Newark N
J Has taken special courses in Columbia
University 34
9Who was the first person to land on the Moon?
10Alert! Alert!I found your Jeep Liberty for under
8,000.
2002 Jeep Liberty
7,995
Toll free 1-800-423-0334
11Toward a Web of Knowledge
- Fundamental questions
- What is knowledge?
- What are facts?
- How does one know?
- Philosophy
- Ontology
- Epistemology
- Logic and reasoning
12Ontology
- Existence ? asks What exists?
- Concepts, relationships, and constraints with
formal foundation
13Epistemology
- The nature of knowledge ? asks What is
knowledge? and How is knowledge acquired? - Populated conceptual model
14Logic and Reasoning
- Principles of valid inference asks What is
known? and What can be inferred? - For us, it answers what can be inferred (in a
formal sense) from conceptualized data.
Find price and mileage of red Nissans, 1990 or
newer
15Making this Work ? How?
- Distill knowledge from the wealth of digital web
data - Annotate web pages
- Need a computational alembic to algorithmically
turn raw symbols contained in web pages into
knowledge
Annotation
Annotation
Fact
Fact
Fact
16Turning Raw Symbols into Knowledge
- Symbols 11,500 117K Nissan CD AC
- Data price(11,500) mileage(117K)
make(Nissan) - Conceptualized data
- Car(C123) has Price(11,500)
- Car(C123) has Mileage(117,000)
- Car(C123) has Make(Nissan)
- Car(C123) has Feature(AC)
- Knowledge
- Correct facts
- Provenance
17Actualization (with Extraction Ontologies)
Find me the price and mileage of all red Nissans
I want a 1990 or newer.
18Data Extraction Demo
19Semantic Annotation Demo
20Free-Form Query Demo
21Explanation How it Works
- Extraction Ontologies
- Semantic Annotation
- Free-Form Query Interpretation
22Extraction Ontologies
Object sets Relationship sets Participation
constraints Lexical Non-lexical Primary object
set Aggregation Generalization/Specialization
23Extraction Ontologies
Data Frame
Internal Representation float
Values
External Rep. \s\s(\d1,3)(\.\d2)?
Left Context
Key Word Phrase
Key Words (Pprice)(Ccost)
Operators
Operator gt
Key Words (more\sthan)(more\scostly)
24Generality Resiliency ofExtraction Ontologies
- Generality assumptions about web pages
- Data rich
- Narrow domain
- Document types
- Single-record documents (hard, but doable)
- Multiple-record documents (harder)
- Records with scattered components (even harder)
- Resiliency declarative
- Still works when web pages change
- Works for new, unseen pages in the same domain
- Scalable, but takes work to declare the
extraction ontology
25Semantic Annotation
26Free-Form Query Interpretation
- Parse Free-Form Query
- (with respect to data extraction ontology)
- Select Ontology
- Formulate Query Expression
- Run Query Over Semantically Annotated Data
27Parse Free-Form Query
Find me the and of all
s I want a
price
mileage
red
Nissan
1996
or newer
gt Operator
28Select Ontology
Find me the price and mileage of all red Nissans
I want a 1996 or newer
29Formulate Query Expression
- Conjunctive queries and aggregate queries
- Mentioned object sets are all of interest.
- Values and operator keywords determine
conditions. - Color red
- Make Nissan
- Year gt 1996
gt Operator
30Formulate Query Expression
For
Let
Where
Return
31Run QueryOver Semantically Annotated Data
32Great!But Problems Still Need Resolution
- Automating content annotation
- Extraction-ontology creation a few dozen person
hours - Semi-automatic creation
- FOCIH (Form-based Ontology Creation and
Information Harvesting) - TISP (Table Interpretation by Sibling Pages)
- TANGO (Table ANalysis for Generating Ontologies)
- Stepping up to the envisioned Web of Knowledge
- Current future work
- Semi-automatic annotation via synergistic
bootstrapping - Knowledge bundles for research studies
- Practicalities
33Manual Creation
34Manual Creation
35Manual Creation
- Library of instance recognizers
- Library of lexicons
36Craigs List Alerter
- Constructed as a short class project
- 10 applications
- a few dozen hours
- Demo
37FOCIH Form-based Ontology Creation and
Information Harvesting
- Forms (general familiarity)
- Information Harvesting
- Semi-automatic extraction ontology creation
- Form-based generation of conceptual model
- Instance-recognizer creation
- Lexicons
- Some pre-existing instance recognizers
38FOCIH Form Creation
39FOCIH Ontology Generation
40FOCIH Information Harvesting
41FOCIHInformation-Harvesting Demo
42TISPTable Interpretation with Sibling Pages
43Interpretation TechniqueSibling Page Comparison
Same
44Interpretation TechniqueSibling Page Comparison
Almost Same
45Interpretation TechniqueSibling Page Comparison
Different
Same
46Technique Details
- Unnest tables
- Match tables in sibling pages
- Perfect match (table for layout ? discard )
- Reasonable match (sibling table)
- Determine use table-structure pattern
- Discover pattern
- Pattern usage
- Dynamic pattern adjustment
47Table Unnesting
48Simple Tree Matching Algorithm
Yang91
Match Score Categorization Exact/Near-Exact,
Sibling-Table, False
49Table Structure Patterns
- Regularity Expectations
- (lttrgtlt(tdth)gt L lt(tdth)gt V)n
- lttrgt(lt(tdth)gt L)n
- (lttrgt(lt(tdth)gt V)n)
-
Pattern combinations are also possible.
50Pattern Usage
(Location.Genetic Position) X12.69 /- 0.000
cM mapping data (Location.Genomic Position)
X13518823..13515773 bp
51Dynamic Pattern Adjustment
lttrgt(lt(tdth)gt L)5 (lttrgt(lt(tdth)gt V)5)
lttrgt(lt(tdth)gt L)5 (lttrgt(lt(tdth)gt V)5)
lttrgt(lt(tdth)gt L)6 (lttrgt(lt(tdth)gt V)6)
52TISP Demo
53TISP/FOCIHExtraction Ontology Creation
- Reverse engineer with TISP
- Adjust with FOCIH
- Data frames
- Initialize lexicons with harvested data
- Library of data framesselect and specialize
54TISP/FOCIHExtraction Ontology Creation
55TISP/FOCIHExtraction Ontology Creation
56TISP/FOCIHExtraction Ontology Creation
57TISP/FOCIHExtraction Ontology Creation
58TISP/FOCIHExtraction Ontology Creation
59TISP/FOCIHExtraction Ontology Creation
60TANGOTable Analysis for Generating Ontologies
- Recognize and normalize table information
- Construct mini-ontologies from tables
- Discover inter-ontology mappings
- Merge mini-ontologies into a growing ontology
61Recognize Table Information
Religion
Population Albanian
Roman Shia
Sunni Country (July 2001 est.) Orthodox
Muslim Catholic Muslim Muslim
other Afganistan 26,813,057
15
84 1 Albania
3,510,484 20 70 10
62Construct Mini-Ontology
63Discover Mappings
64Merge
65TANGO Demo
66Semi-Automatic Annotation viaSynergistic
Bootstrapping(Based on Nested Schemas with
Regular Expressions)
- Build a page-layout, pattern-based annotator
- Automate layout recognition based on examples
- Auto-generate examples with extraction ontologies
- Synergistically run pattern-based annotator
extraction-ontology annotator
67PatML Editor
Information Structure Tree
Page Source Text
Browser-Rendered Page
68(No Transcript)
69Synergistic Execution
Extraction Ontology
Partially Annotated Document
Conceptual Annotator (ontology-based annotation)
Pattern Generation
Document
Layout Patterns
Annotated Document
Structural Annotator (layout-driven annotation)
70Knowledge Bundles forResearch Studies
True Story
To do a recent study about associations between
lung cancer and tp53 polymorphism, researchers
needed to (1) do a keyword-based search on the
SNP data repository for tp53'' within organism
"homo sapiens" (2) from the returned records,
open each record page one by one and find those
coding SNPs that have a minor allele frequency
greater than 1 (3) for each qualifying SNP,
record the SNP ID and many properties of the SNP
(4) perform a keyword search in PubMed and skim
the hundreds of manuscripts found to determine
which manuscripts are related to the SNPs of
interest and fit their search criteria and (5)
extract the information of interest (e.g., the
statistical information, patient information, and
treatment information) and organize it.
71Knowledge Bundles forResearch Studies
(1) Search, (2) Filter, (3) Record information
72Knowledge Bundles forResearch Studies
(4) High precision literature search
73Knowledge Bundles forResearch Studies
(5) Extract and organize
74Knowledge Bundles forResearch Studies
75Knowledge Bundles forResearch Studies
Research Challenge I believe that a good
biomedical scenario would be to select a topic
which already large structured database (gene
extraction, vitamins, blood), and then search for
and find web pages that augment, support or
refute specific aspects of that
database. GN
76Practicalities Bootstrapping the WoK
(Future Work)
- Wont just happen without sufficient content
- Niche applications
- Historical Data (e.g. Genealogy)
- Bio-research studies
- Local WoKs
- Intra-organizational effort
- Individual interests
77Practicalities Scalability
(Future Work)
- Potential Rapid growth
- Thousands of ontologies
- Millions of simultaneous queries
- Billions of annotated pages
- Trillions of facts
- Search-engine-like caching query processing
78Key to SuccessSimplicity via Automation
- Automatic (or near automatic) creation of
extraction ontologies - Automatic (or near automatic) annotation of web
pages - Simple but accurate query specification without
specialized training
www.deg.byu.edu www.tango.byu.edu