Machine Reading of Web Text - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

Machine Reading of Web Text

Description:

Compare Obama's 'buzz' versus Hillary's? What is a quiet, inexpensive, 4-star ... (X, born in, 1941) (M, born in, 1941) (X, citizen of, US) (M, citizen of, US) ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 54

Provided by: velblodVid

Category:

Tags: born | in | machine | not | reading | text | us | web

more less

Transcript and Presenter's Notes

Title: Machine Reading of Web Text

1
Machine Reading of Web Text

Oren Etzioni
Turing Center
University of Washington
http//turing.cs.washington.edu

2
Rorschach Test
3
Rorschach Test for CS
4
Moores Law?
5
Storage Capacity?
6
Number of Web Pages?
7
Number of Facebook Users?
8
(No Transcript)
9
Turing Center Foci

Scale MT to 49,000,000 language pairs
2,500,000 word translation graph
P(V ? F ? C)?
PanImages
Accumulate knowledge from the Web
A new paradigm for Web Search

10
Outline

A New Paradigm for Search
Open Information Extraction
Tractable Inference
Conclusions

11
Web Search in 2020?

Type key words into a search box?
Social or human powered Search?
The Semantic Web?
What about our technology exponentials?
The best way to predict the future is to invent
it!

12
Intelligent Search

Instead of merely retrieving Web pages, read em!
Machine Reading Information Extraction (IE)
tractable inference
IE(sentence) who did what?
speaker(Alon Halevy, UW)
Inference uncover implicit information
Will Alon visit Seattle?

13
Application Information Fusion

What kills bacteria?
What west coast, nano-technology companies are
hiring?
Compare Obamas buzz versus Hillarys?
What is a quiet, inexpensive, 4-star hotel in
Vancouver?

14

Opinion Mining

Opine (Popescu Etzioni, EMNLP 05)
IE(product reviews)
Informative
Abundant, but varied
Textual
Summarize reviews without any prior knowledge of
product category

15
(No Transcript)
16
(No Transcript)
17
But Reading the Web is Tough

Traditional IE is narrow
IE has been applied to small, homogenous corpora
No parser achieves high accuracy
No named-entity taggers
No supervised learning
How about semi-supervised learning?

18
Semi-Supervised Learning
per concept!

Few hand-labeled examples
? Limit on the number of concepts
? Concepts are pre-specified
? Problematic for the Web
Alternative self-supervised learning
Learner discovers concepts on the fly
Learner automatically labels examples

19
2. Open IE Self-supervised IE (Banko,
Cafarella, Soderland, et. al, IJCAI 07)
20
Extractor Overview (Banko Etzioni, 08)

Use a simple model of relationships in English to
label extractions
Bootstrap a general model of relationships in
English sentences, encoded as a CRF
Decompose each sentence into one or more (NP1,
VP, NP2) chunks
Use CRF model to retain relevant parts of each NP
and VP.
The extractor is relation-independent!

21
TextRunner Extraction

Extract Triple representing binary relation
(Arg1, Relation, Arg2) from sentence.
Internet powerhouse, EBay, was originally founded
by Pierre Omidyar.
Internet powerhouse, EBay, was originally founded
by Pierre Omidyar.
(Ebay, Founded by, Pierre Omidyar)

22
Numerous Extraction Challenges

Drop non-essential info
was originally founded by ? founded by
Retain key distinctions
Ebay founded by Pierr ? Ebay founded Pierre
Non-verb relationships
George Bush, president of the U.S
Synonymy aliasing
Albert Einstein Einstein ? Einstein Bros.

23
TextRunner (Webs 1st Open IE system)

Self-Supervised Learner automatically labels
example extractions learns an extractor
Single-Pass Extractor single pass over corpus,
identifying extractions in each sentence
Query Processor indexes extractions? enables
queries at interactive speeds

24
TextRunner Demo
25
(No Transcript)
26
(No Transcript)
27
Sample of 9 million Web Pages

Concrete facts (Oppenheimer, taught at,
Berkeley)
Abstract facts (fruit, contain, vitamins)

28
3. Tractable Inference

Much of textual information is implicit
Entity and predicate resolution
Probability of correctness
Composing facts to draw conclusions

29
I. Entity Resolution

Resolver (Yates Etzioni, HLT 07) determines
synonymy based on relations found by TextRunner
(cf. Pantel Lin 01)
(X, born in, 1941) (M, born in, 1941)
(X, citizen of, US) (M, citizen of, US)
(X, friend of, Joe) (M, friend of, Mary)
P(X M) shared relations

30
Relation Synonymy

(1, R, 2)
(2, R 4)
(4, R, 8)
Etc.

(1, R 2)
(2, R, 4)
(4, R 8)
Etc.

P(R R) shared argument pairs
Unsupervised probabilistic model
O(N log N) algorithm run on millions of docs

31
II. Probability of Correctness

How likely is an extraction to be correct?
Factors to consider include
Authoritativeness of source
Confidence in extraction method
Number of independent extractions

32
Counting Extractions

Lexico-syntactic patterns (Hearst 92)
cities such as Seattle, Boston, and
Turneys PMI-IR, ACL 02
PMI co-occur frequency ? results
results ? confidence in class membership.

33
Formal Problem Statement

If an extraction x appears k times in a set of n
distinct sentences each suggesting that x belongs
to C, what is the probability that x ? C ?
C is a class (cities) or a relation (mayor
of)
Note we only count distinct sentences!

34
Combinatorial Model (Urns)
Odds increase exponentially with k, but decrease
exponentially with n See Downey et al.s IJCAI
05 paper for formal details.
35
Performance (15x Improvement)
Self supervised, domain independent method
36
URNS limited on sparse facts
context
37
Language Models to the Rescue (Downey,
Schoenmackers, Etzioni, ACL 07)

Instead of only lexico-syntactic patterns,
leverage all contexts of a particular entity
Statistical type check
does Pickerington behave like a city?
does Shaver behave like a mayor?
Language model HMM (built once per corpus)
Project string to point in 20-dimensional space
Measure proximity of Pickerington to Seattle,
Boston, etc.

38
III Compositional Inference (work in progress,
Schoenmackers, Etzioni, Weld)

Implicit information, (224)
TextRunner (Turing, born in, London)
WordNet (London, part of, England)
Rule born in is transitive thru part of
Conclusion (Turing, born in, England)
Mechanism MLN instantiated on the fly
Rules learned from corpus (future work)
Inference Demo

39
KnowItAll Family Tree
Mulder 01
WebKB 99 PMI-IR 01
KnowItAll, 04
Opine 05
BE 05
Urns
Woodward 06
KnowItNow 05
Resolver 07
TextRunner 07
REALM 07
Inference 08
40
KnowItAll Team

Michele Banko
Michael Cafarella
Doug Downey
Alan Ritter
Dr. Stephen Soderland
Stefan Schoenmackers
Prof. Dan Weld
Mausam
Alumni Dr. Ana-Maria Popescu, Dr. Alex Yates,
and others.

41
Related Work

Sekines pre-empty IE
Powerset
Textual Entailment
AAAI 07 Symposium on Machine Reading
Growing body of work on IE from the Web

42
4. Conclusions

Imagine search systems that operate over a (more)
semantic space
Key words, documents ? extractions
TF-IDF, pagerank ? relational models
Web pages, hyper links ? entities, relns
Reading the Web ? new Search Paradigm

43
Thank you
44

Machine Reading Unsupervised understanding of
text
Much is implicit ? tractable inference is
key!

45
HMM in more detail
Training seek to maximize probability of corpus
w given latent states t using EM
ti
ti1
ti2
ti3
ti4
wi
wi1
wi2
wi3
wi4
cities such as Los
Angeles
46
Using the HMM at Query Time

Given a set of extractions (Arg1, Rln, Arg2)
Seeds most frequent Args for Rln

Distribution over t is read from the HMM
Compute KL divergence via f(arg, seeds)
For each extraction, average f over Arg1 Arg2
Sort sparse extractions in ascending order

47
Language Modeling Open IE

Self supervised
Illuminating phrases ? full context
Handles sparse extractions

48
Focus Open IE on Web Text
Advantages
Challenges

Semantically tractable sentences
Redundancy Search engines
Difficult, ungrammatical sentences
Unreliable information Heterogeneous corpus

49
II. Probability of Correctness

How likely is an extraction to be correct?
Distributional Hypothesis words that occur in
the same contexts tend to have similar meanings
KnowItAll Hypothesis extractions that occur in
the same informative contexts more frequently are
more likely to be correct.

50
Argument Type checking via HMM

Relations arguments are typed
(Person, Mayor Of, City)
Training Model distribution of Person City
contexts in corpus (Distributional Hypothesis)
Query time Rank sparse triples by how well each
arguments context distribution matches that of
its type

51
Silly Example

(Shaver, Mayor of, Pickerington) over (Spice
Girls, Mayor of, Microsoft)
Because
Shavers contexts are more like other mayors
than Spice Girls, and
Pickerington's contexts are more like other
cities than Microsofts

52
Utilizing HMMs to Check Types

Challenges
Argument types are not known
Cant build model for each argument type
Textual types are fuzzy
Solution Train an HMM for the corpus using EM
bootstrap
REALM improves precision by 90

53
Query Was Turing born in England?
BornIn(X, city) -gt BornIn(X, country)
Knowledge Bases
TextRunner, WordNet
Query Formula
BornIn(Turing, England)?
Inference Rules
Find Best Query
Best KB Query
WordNet X is in England
TextRunner Turing born in X
In(London, England)
Run Query
BornIn(Turing, London) BornIn(Turing, England)
Turing was born in London
Query Results
London is in England
Find Implied Nodes Cliques
New Nodes Cliques
BornIn(Turing, London) BornIn(Turing, England)
In(London, England)
Results
Yes! Turing was born in England!

Write a Comment

User Comments (0)