David W. Embley - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

David W. Embley

Description:

WoK: A Web of Knowledge David W. Embley Brigham Young University Provo, Utah, USA – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 79
Provided by: Davi1410
Category:
Tags: crime | david | embley | mapping

less

Transcript and Presenter's Notes

Title: David W. Embley


1
WoK A Web of Knowledge
  • David W. Embley
  • Brigham Young University
  • Provo, Utah, USA

2
A Web of Pages ? A Web of Facts
  • Birthdate of my great grandpa Orson
  • Price and mileage of red Nissans, 1990 or newer
  • Location and size of chromosome 17
  • US states with property crime rates above 1

3
Find me an image that is red, dark, scary, and
beautiful.
4
Learn rules to recognize names, even under
less-than-ideal OCRd documents.
  • Seed models
  • Prefix Mrs, Miss, Mr
  • Initials A, B, C,
  • Given Name Charles, Francis, Herbert
  • Surname Goodrich, Wells, White
  • Stopword Jewell, Graves
  • Updates
  • Prefix first token in line
  • Given Name between Prefix and Initial
  • Surname between initial and lt/Sgt

M RS CHARLES A JEWELL MRS FRANCIS B COOI
EN MRS P W ELILSWVORT MRs HERBERT C
ADSWVORTH MRS HENRY E TAINTOR MR DANIEl H
WELLS MRS ARTHUR L GOODRICH Miss JOSEPHINE
WHITE Mss JULIA A GRAVES Ms H B LANGDON Miss
MARY H ADAMS Miss ELIZA F Mix 'MRs MIARY C ST
)NEC MIRS AI I ERT H PITKIN
5
Annotating Music and Lyrics
Find something soothing but energetic Good for
recovering patients. How about Mozarts 40th
Symphony?
6
Build a knowledge bundle for checking the
association between tp53 polymorphism and lung
cancer.
7
U.S. President Barack Obama visited Iraq Monday
in a stop that was overshadowed by the question
of when U.S. troops should go home. Obama made
his opposition to the U.S.-led invasion of Iraq
five years ago a centerpiece of his campaign and
was in Baghdad to assess security in Iraq, where
violence has fallen to its lowest level since
early 2004. When has Barack Obama visited
Iraq? Which U.S. Presidents have visited Iraq?
8
Find names, locations, events, and dates and
associations among them for my great grandma
Margaret Haines.
I GERTRUDE SMITH (Mrs William E Haines deceased)
Married shortly after graduation Died at age of
22 Was musician and taught piano lessons 1898
HOBART L BENEDICT Millburn Essex County N J
Graduated from Rutgers 1902 and from New York Law
School in 1904 with degrees of B Sc M Sc and LL B
Married April 9 1907 to Martha C Bunnell One
daughter Elizabeth Benedict Counsellor at law
with offices in Elizabeth and Millburn MARTHA
BUNNELL (Mrs Hobart L Benedict) Millburn Essex
County N J Married to Hobart L Benedict on date
above 1899 CORA SMITH (Mrs Louis Slingerland) 557
Third St South St Peters- burg Florida Married
Louis Slingerland a former pupil of Connec- Farms
High School Mr Slingerland is engaged in building
business in St Petersburg JENNIE HAINES Elmwood
Ave Union Union Co N J Graduated from State
Normal School Trenton N J in 190 5 Principal of
Hurden Looker School in Hillside Township
formerly a part of Union Town- ship STELLA
ILLSLEY (Mrs Harry Engel) Hollis Long Island N Y
WALTER BOSCHEN Morris Ave Union N J Completed
fourth year at Battin High School in 1900
Attended Rutgers College taking up civil
engineering course Has been successful in the
business world President of the W G Boschen Sales
Co Inc manufacturer general agents for mechanical
line GEORGE McQUAIDE Springfield N J Was employed
by Morris County Traction Company 1900 No
graduates 1901 No graduates 1902 MARGARET HAINES
Elmwood Ave Union N J Took up stenography and
typewriting and is now employed as private
secretary of the Correspondence Department of the
Singer Manufacturing Com- pany of Elizabeth N J
ABBY HEADLEY (Mrs Leslie Ward) 5 Rose St Newark N
J CLARENCE GRIGGS Stuyvesant Ave Union N J
Graduated from Trenton State Normal School in
1905 having specialized in manual training Taught
one half year at Neshanic N J one year at Lin-
coin School Roselle N J Teaching manual
39training and mechanical drawing in Newark N
J Has taken special courses in Columbia
University 34
9
Who was the first person to land on the Moon?
10
Alert! Alert!I found your Jeep Liberty for under
8,000.
2002 Jeep Liberty
7,995
Toll free 1-800-423-0334
11
Toward a Web of Knowledge
  • Fundamental questions
  • What is knowledge?
  • What are facts?
  • How does one know?
  • Philosophy
  • Ontology
  • Epistemology
  • Logic and reasoning

12
Ontology
  • Existence ? asks What exists?
  • Concepts, relationships, and constraints with
    formal foundation

13
Epistemology
  • The nature of knowledge ? asks What is
    knowledge? and How is knowledge acquired?
  • Populated conceptual model

14
Logic and Reasoning
  • Principles of valid inference asks What is
    known? and What can be inferred?
  • For us, it answers what can be inferred (in a
    formal sense) from conceptualized data.

Find price and mileage of red Nissans, 1990 or
newer
15
Making this Work ? How?
  • Distill knowledge from the wealth of digital web
    data
  • Annotate web pages
  • Need a computational alembic to algorithmically
    turn raw symbols contained in web pages into
    knowledge

Annotation
Annotation


Fact
Fact
Fact
16
Turning Raw Symbols into Knowledge
  • Symbols 11,500 117K Nissan CD AC
  • Data price(11,500) mileage(117K)
    make(Nissan)
  • Conceptualized data
  • Car(C123) has Price(11,500)
  • Car(C123) has Mileage(117,000)
  • Car(C123) has Make(Nissan)
  • Car(C123) has Feature(AC)
  • Knowledge
  • Correct facts
  • Provenance

17
Actualization (with Extraction Ontologies)
Find me the price and mileage of all red Nissans
I want a 1990 or newer.
18
Data Extraction Demo
19
Semantic Annotation Demo
20
Free-Form Query Demo
21
Explanation How it Works
  • Extraction Ontologies
  • Semantic Annotation
  • Free-Form Query Interpretation

22
Extraction Ontologies
Object sets Relationship sets Participation
constraints Lexical Non-lexical Primary object
set Aggregation Generalization/Specialization
23
Extraction Ontologies
Data Frame
Internal Representation float
Values
External Rep. \s\s(\d1,3)(\.\d2)?
Left Context
Key Word Phrase
Key Words (Pprice)(Ccost)
Operators
Operator gt
Key Words (more\sthan)(more\scostly)
24
Generality Resiliency ofExtraction Ontologies
  • Generality assumptions about web pages
  • Data rich
  • Narrow domain
  • Document types
  • Single-record documents (hard, but doable)
  • Multiple-record documents (harder)
  • Records with scattered components (even harder)
  • Resiliency declarative
  • Still works when web pages change
  • Works for new, unseen pages in the same domain
  • Scalable, but takes work to declare the
    extraction ontology

25
Semantic Annotation
26
Free-Form Query Interpretation
  • Parse Free-Form Query
  • (with respect to data extraction ontology)
  • Select Ontology
  • Formulate Query Expression
  • Run Query Over Semantically Annotated Data

27
Parse Free-Form Query
Find me the and of all
s I want a

price
mileage
red
Nissan
1996
or newer
gt Operator
28
Select Ontology
Find me the price and mileage of all red Nissans
I want a 1996 or newer
29
Formulate Query Expression
  • Conjunctive queries and aggregate queries
  • Mentioned object sets are all of interest.
  • Values and operator keywords determine
    conditions.
  • Color red
  • Make Nissan
  • Year gt 1996

gt Operator
30
Formulate Query Expression
For
Let
Where
Return
31
Run QueryOver Semantically Annotated Data
32
Great!But Problems Still Need Resolution
  • Automating content annotation
  • Extraction-ontology creation a few dozen person
    hours
  • Semi-automatic creation
  • FOCIH (Form-based Ontology Creation and
    Information Harvesting)
  • TISP (Table Interpretation by Sibling Pages)
  • TANGO (Table ANalysis for Generating Ontologies)
  • Stepping up to the envisioned Web of Knowledge
  • Current future work
  • Semi-automatic annotation via synergistic
    bootstrapping
  • Knowledge bundles for research studies
  • Practicalities

33
Manual Creation
34
Manual Creation
35
Manual Creation
  • Library of instance recognizers
  • Library of lexicons

36
Craigs List Alerter
  • Constructed as a short class project
  • 10 applications
  • a few dozen hours
  • Demo

37
FOCIH Form-based Ontology Creation and
Information Harvesting
  • Forms (general familiarity)
  • Information Harvesting
  • Semi-automatic extraction ontology creation
  • Form-based generation of conceptual model
  • Instance-recognizer creation
  • Lexicons
  • Some pre-existing instance recognizers

38
FOCIH Form Creation
39
FOCIH Ontology Generation
40
FOCIH Information Harvesting
41
FOCIHInformation-Harvesting Demo
42
TISPTable Interpretation with Sibling Pages
43
Interpretation TechniqueSibling Page Comparison
Same
44
Interpretation TechniqueSibling Page Comparison
Almost Same
45
Interpretation TechniqueSibling Page Comparison
Different
Same
46
Technique Details
  • Unnest tables
  • Match tables in sibling pages
  • Perfect match (table for layout ? discard )
  • Reasonable match (sibling table)
  • Determine use table-structure pattern
  • Discover pattern
  • Pattern usage
  • Dynamic pattern adjustment

47
Table Unnesting
48
Simple Tree Matching Algorithm
Yang91
Match Score Categorization Exact/Near-Exact,
Sibling-Table, False
49
Table Structure Patterns
  • Regularity Expectations
  • (lttrgtlt(tdth)gt L lt(tdth)gt V)n
  • lttrgt(lt(tdth)gt L)n
  • (lttrgt(lt(tdth)gt V)n)

Pattern combinations are also possible.
50
Pattern Usage
(Location.Genetic Position) X12.69 /- 0.000
cM mapping data (Location.Genomic Position)
X13518823..13515773 bp
51
Dynamic Pattern Adjustment
lttrgt(lt(tdth)gt L)5 (lttrgt(lt(tdth)gt V)5)
lttrgt(lt(tdth)gt L)5 (lttrgt(lt(tdth)gt V)5)
lttrgt(lt(tdth)gt L)6 (lttrgt(lt(tdth)gt V)6)
52
TISP Demo
53
TISP/FOCIHExtraction Ontology Creation
  • Reverse engineer with TISP
  • Adjust with FOCIH
  • Data frames
  • Initialize lexicons with harvested data
  • Library of data framesselect and specialize

54
TISP/FOCIHExtraction Ontology Creation
55
TISP/FOCIHExtraction Ontology Creation
56
TISP/FOCIHExtraction Ontology Creation
57
TISP/FOCIHExtraction Ontology Creation
58
TISP/FOCIHExtraction Ontology Creation
59
TISP/FOCIHExtraction Ontology Creation
60
TANGOTable Analysis for Generating Ontologies
  • Recognize and normalize table information
  • Construct mini-ontologies from tables
  • Discover inter-ontology mappings
  • Merge mini-ontologies into a growing ontology

61
Recognize Table Information

Religion
Population Albanian
Roman Shia
Sunni Country (July 2001 est.) Orthodox
Muslim Catholic Muslim Muslim
other Afganistan 26,813,057
15
84 1 Albania
3,510,484 20 70 10
62
Construct Mini-Ontology
63
Discover Mappings
64
Merge
65
TANGO Demo
66
Semi-Automatic Annotation viaSynergistic
Bootstrapping(Based on Nested Schemas with
Regular Expressions)
  • Build a page-layout, pattern-based annotator
  • Automate layout recognition based on examples
  • Auto-generate examples with extraction ontologies
  • Synergistically run pattern-based annotator
    extraction-ontology annotator

67
PatML Editor
Information Structure Tree
Page Source Text
Browser-Rendered Page
68
(No Transcript)
69
Synergistic Execution
Extraction Ontology
Partially Annotated Document
Conceptual Annotator (ontology-based annotation)
Pattern Generation
Document
Layout Patterns
Annotated Document
Structural Annotator (layout-driven annotation)
70
Knowledge Bundles forResearch Studies
True Story
To do a recent study about associations between
lung cancer and tp53 polymorphism, researchers
needed to (1) do a keyword-based search on the
SNP data repository for tp53'' within organism
"homo sapiens" (2) from the returned records,
open each record page one by one and find those
coding SNPs that have a minor allele frequency
greater than 1 (3) for each qualifying SNP,
record the SNP ID and many properties of the SNP
(4) perform a keyword search in PubMed and skim
the hundreds of manuscripts found to determine
which manuscripts are related to the SNPs of
interest and fit their search criteria and (5)
extract the information of interest (e.g., the
statistical information, patient information, and
treatment information) and organize it.
71
Knowledge Bundles forResearch Studies
(1) Search, (2) Filter, (3) Record information
72
Knowledge Bundles forResearch Studies
(4) High precision literature search
73
Knowledge Bundles forResearch Studies
(5) Extract and organize
74
Knowledge Bundles forResearch Studies
75
Knowledge Bundles forResearch Studies
Research Challenge I believe that a good
biomedical scenario would be to select a topic
which already large structured database (gene
extraction, vitamins, blood), and then search for
and find web pages that augment, support or
refute specific aspects of that
database. GN
76
Practicalities Bootstrapping the WoK
(Future Work)
  • Wont just happen without sufficient content
  • Niche applications
  • Historical Data (e.g. Genealogy)
  • Bio-research studies
  • Local WoKs
  • Intra-organizational effort
  • Individual interests

77
Practicalities Scalability
(Future Work)
  • Potential Rapid growth
  • Thousands of ontologies
  • Millions of simultaneous queries
  • Billions of annotated pages
  • Trillions of facts
  • Search-engine-like caching query processing

78
Key to SuccessSimplicity via Automation
  • Automatic (or near automatic) creation of
    extraction ontologies
  • Automatic (or near automatic) annotation of web
    pages
  • Simple but accurate query specification without
    specialized training

www.deg.byu.edu www.tango.byu.edu
Write a Comment
User Comments (0)
About PowerShow.com