Title: Machine Reading of Web Text
1Machine Reading of Web Text
- Oren Etzioni
- Turing Center
- University of Washington
- http//turing.cs.washington.edu
2Rorschach Test
3Rorschach Test for CS
4Moores Law?
5Storage Capacity?
6Number of Web Pages?
7Number of Facebook Users?
8(No Transcript)
9Turing Center Foci
- Scale MT to 49,000,000 language pairs
- 2,500,000 word translation graph
- P(V ? F ? C)?
- PanImages
- Accumulate knowledge from the Web
- A new paradigm for Web Search
10Outline
- A New Paradigm for Search
- Open Information Extraction
- Tractable Inference
- Conclusions
11Web Search in 2020?
- Type key words into a search box?
- Social or human powered Search?
- The Semantic Web?
- What about our technology exponentials?
- The best way to predict the future is to invent
it!
12Intelligent Search
- Instead of merely retrieving Web pages, read em!
- Machine Reading Information Extraction (IE)
tractable inference - IE(sentence) who did what?
- speaker(Alon Halevy, UW)
- Inference uncover implicit information
- Will Alon visit Seattle?
13Application Information Fusion
- What kills bacteria?
- What west coast, nano-technology companies are
hiring? - Compare Obamas buzz versus Hillarys?
- What is a quiet, inexpensive, 4-star hotel in
Vancouver?
14Opinion Mining
- Opine (Popescu Etzioni, EMNLP 05)
- IE(product reviews)
- Informative
- Abundant, but varied
- Textual
- Summarize reviews without any prior knowledge of
product category
15(No Transcript)
16(No Transcript)
17But Reading the Web is Tough
- Traditional IE is narrow
- IE has been applied to small, homogenous corpora
- No parser achieves high accuracy
- No named-entity taggers
- No supervised learning
- How about semi-supervised learning?
18Semi-Supervised Learning
per concept!
- Few hand-labeled examples
- ? Limit on the number of concepts
- ? Concepts are pre-specified
- ? Problematic for the Web
- Alternative self-supervised learning
- Learner discovers concepts on the fly
- Learner automatically labels examples
192. Open IE Self-supervised IE (Banko,
Cafarella, Soderland, et. al, IJCAI 07)
20Extractor Overview (Banko Etzioni, 08)
- Use a simple model of relationships in English to
label extractions - Bootstrap a general model of relationships in
English sentences, encoded as a CRF - Decompose each sentence into one or more (NP1,
VP, NP2) chunks - Use CRF model to retain relevant parts of each NP
and VP. - The extractor is relation-independent!
21TextRunner Extraction
- Extract Triple representing binary relation
(Arg1, Relation, Arg2) from sentence. - Internet powerhouse, EBay, was originally founded
by Pierre Omidyar. - Internet powerhouse, EBay, was originally founded
by Pierre Omidyar. - (Ebay, Founded by, Pierre Omidyar)
22Numerous Extraction Challenges
- Drop non-essential info
- was originally founded by ? founded by
- Retain key distinctions
- Ebay founded by Pierr ? Ebay founded Pierre
- Non-verb relationships
- George Bush, president of the U.S
- Synonymy aliasing
- Albert Einstein Einstein ? Einstein Bros.
23TextRunner (Webs 1st Open IE system)
- Self-Supervised Learner automatically labels
example extractions learns an extractor - Single-Pass Extractor single pass over corpus,
identifying extractions in each sentence - Query Processor indexes extractions? enables
queries at interactive speeds
24TextRunner Demo
25(No Transcript)
26(No Transcript)
27Sample of 9 million Web Pages
- Concrete facts (Oppenheimer, taught at,
Berkeley) - Abstract facts (fruit, contain, vitamins)
283. Tractable Inference
- Much of textual information is implicit
- Entity and predicate resolution
- Probability of correctness
- Composing facts to draw conclusions
29 I. Entity Resolution
- Resolver (Yates Etzioni, HLT 07) determines
synonymy based on relations found by TextRunner
(cf. Pantel Lin 01) - (X, born in, 1941) (M, born in, 1941)
- (X, citizen of, US) (M, citizen of, US)
- (X, friend of, Joe) (M, friend of, Mary)
- P(X M) shared relations
30Relation Synonymy
- (1, R, 2)
- (2, R 4)
- (4, R, 8)
- Etc.
- (1, R 2)
- (2, R, 4)
- (4, R 8)
- Etc.
- P(R R) shared argument pairs
- Unsupervised probabilistic model
- O(N log N) algorithm run on millions of docs
31II. Probability of Correctness
- How likely is an extraction to be correct?
- Factors to consider include
- Authoritativeness of source
- Confidence in extraction method
- Number of independent extractions
32Counting Extractions
- Lexico-syntactic patterns (Hearst 92)
- cities such as Seattle, Boston, and
- Turneys PMI-IR, ACL 02
- PMI co-occur frequency ? results
- results ? confidence in class membership.
33Formal Problem Statement
- If an extraction x appears k times in a set of n
distinct sentences each suggesting that x belongs
to C, what is the probability that x ? C ? - C is a class (cities) or a relation (mayor
of) - Note we only count distinct sentences!
34Combinatorial Model (Urns)
Odds increase exponentially with k, but decrease
exponentially with n See Downey et al.s IJCAI
05 paper for formal details.
35 Performance (15x Improvement)
Self supervised, domain independent method
36URNS limited on sparse facts
context
37 Language Models to the Rescue (Downey,
Schoenmackers, Etzioni, ACL 07)
- Instead of only lexico-syntactic patterns,
leverage all contexts of a particular entity - Statistical type check
- does Pickerington behave like a city?
- does Shaver behave like a mayor?
- Language model HMM (built once per corpus)
- Project string to point in 20-dimensional space
- Measure proximity of Pickerington to Seattle,
Boston, etc.
38III Compositional Inference (work in progress,
Schoenmackers, Etzioni, Weld)
- Implicit information, (224)
- TextRunner (Turing, born in, London)
- WordNet (London, part of, England)
- Rule born in is transitive thru part of
- Conclusion (Turing, born in, England)
- Mechanism MLN instantiated on the fly
- Rules learned from corpus (future work)
- Inference Demo
39KnowItAll Family Tree
Mulder 01
WebKB 99 PMI-IR 01
KnowItAll, 04
Opine 05
BE 05
Urns
Woodward 06
KnowItNow 05
Resolver 07
TextRunner 07
REALM 07
Inference 08
40KnowItAll Team
- Michele Banko
- Michael Cafarella
- Doug Downey
- Alan Ritter
- Dr. Stephen Soderland
- Stefan Schoenmackers
- Prof. Dan Weld
- Mausam
- Alumni Dr. Ana-Maria Popescu, Dr. Alex Yates,
and others.
41Related Work
- Sekines pre-empty IE
- Powerset
- Textual Entailment
- AAAI 07 Symposium on Machine Reading
- Growing body of work on IE from the Web
424. Conclusions
- Imagine search systems that operate over a (more)
semantic space - Key words, documents ? extractions
- TF-IDF, pagerank ? relational models
- Web pages, hyper links ? entities, relns
- Reading the Web ? new Search Paradigm
43Thank you
44- Machine Reading Unsupervised understanding of
text - Much is implicit ? tractable inference is
key!
45HMM in more detail
Training seek to maximize probability of corpus
w given latent states t using EM
ti
ti1
ti2
ti3
ti4
wi
wi1
wi2
wi3
wi4
cities such as Los
Angeles
46Using the HMM at Query Time
- Given a set of extractions (Arg1, Rln, Arg2)
- Seeds most frequent Args for Rln
- Distribution over t is read from the HMM
- Compute KL divergence via f(arg, seeds)
- For each extraction, average f over Arg1 Arg2
- Sort sparse extractions in ascending order
47Language Modeling Open IE
- Self supervised
- Illuminating phrases ? full context
- Handles sparse extractions
48Focus Open IE on Web Text
Advantages
Challenges
Semantically tractable sentences
Redundancy Search engines
Difficult, ungrammatical sentences
Unreliable information Heterogeneous corpus
49II. Probability of Correctness
- How likely is an extraction to be correct?
- Distributional Hypothesis words that occur in
the same contexts tend to have similar meanings - KnowItAll Hypothesis extractions that occur in
the same informative contexts more frequently are
more likely to be correct.
50Argument Type checking via HMM
- Relations arguments are typed
- (Person, Mayor Of, City)
- Training Model distribution of Person City
contexts in corpus (Distributional Hypothesis) - Query time Rank sparse triples by how well each
arguments context distribution matches that of
its type
51Silly Example
- (Shaver, Mayor of, Pickerington) over (Spice
Girls, Mayor of, Microsoft) - Because
- Shavers contexts are more like other mayors
than Spice Girls, and - Pickerington's contexts are more like other
cities than Microsofts
52Utilizing HMMs to Check Types
- Challenges
- Argument types are not known
- Cant build model for each argument type
- Textual types are fuzzy
- Solution Train an HMM for the corpus using EM
bootstrap - REALM improves precision by 90
53Query Was Turing born in England?
BornIn(X, city) -gt BornIn(X, country)
Knowledge Bases
TextRunner, WordNet
Query Formula
BornIn(Turing, England)?
Inference Rules
Find Best Query
Best KB Query
WordNet X is in England
TextRunner Turing born in X
In(London, England)
Run Query
BornIn(Turing, London) BornIn(Turing, England)
Turing was born in London
Query Results
London is in England
Find Implied Nodes Cliques
New Nodes Cliques
BornIn(Turing, London) BornIn(Turing, England)
In(London, England)
Results
Yes! Turing was born in England!