Machine Reading of Web Text - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Machine Reading of Web Text

Description:

Compare Obama's 'buzz' versus Hillary's? What is a quiet, inexpensive, 4-star ... (X, born in, 1941) (M, born in, 1941) (X, citizen of, US) (M, citizen of, US) ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 54
Provided by: velblodVid
Category:
Tags: born | in | machine | not | reading | text | us | web

less

Transcript and Presenter's Notes

Title: Machine Reading of Web Text


1
Machine Reading of Web Text
  • Oren Etzioni
  • Turing Center
  • University of Washington
  • http//turing.cs.washington.edu

2
Rorschach Test
3
Rorschach Test for CS
4
Moores Law?
5
Storage Capacity?
6
Number of Web Pages?
7
Number of Facebook Users?
8
(No Transcript)
9
Turing Center Foci
  • Scale MT to 49,000,000 language pairs
  • 2,500,000 word translation graph
  • P(V ? F ? C)?
  • PanImages
  • Accumulate knowledge from the Web
  • A new paradigm for Web Search

10
Outline
  • A New Paradigm for Search
  • Open Information Extraction
  • Tractable Inference
  • Conclusions

11
Web Search in 2020?
  • Type key words into a search box?
  • Social or human powered Search?
  • The Semantic Web?
  • What about our technology exponentials?
  • The best way to predict the future is to invent
    it!

12
Intelligent Search
  • Instead of merely retrieving Web pages, read em!
  • Machine Reading Information Extraction (IE)
    tractable inference
  • IE(sentence) who did what?
  • speaker(Alon Halevy, UW)
  • Inference uncover implicit information
  • Will Alon visit Seattle?

13
Application Information Fusion
  • What kills bacteria?
  • What west coast, nano-technology companies are
    hiring?
  • Compare Obamas buzz versus Hillarys?
  • What is a quiet, inexpensive, 4-star hotel in
    Vancouver?

14

Opinion Mining
  • Opine (Popescu Etzioni, EMNLP 05)
  • IE(product reviews)
  • Informative
  • Abundant, but varied
  • Textual
  • Summarize reviews without any prior knowledge of
    product category

15
(No Transcript)
16
(No Transcript)
17
But Reading the Web is Tough
  • Traditional IE is narrow
  • IE has been applied to small, homogenous corpora
  • No parser achieves high accuracy
  • No named-entity taggers
  • No supervised learning
  • How about semi-supervised learning?

18
Semi-Supervised Learning
per concept!
  • Few hand-labeled examples
  • ? Limit on the number of concepts
  • ? Concepts are pre-specified
  • ? Problematic for the Web
  • Alternative self-supervised learning
  • Learner discovers concepts on the fly
  • Learner automatically labels examples

19
2. Open IE Self-supervised IE (Banko,
Cafarella, Soderland, et. al, IJCAI 07)
20
Extractor Overview (Banko Etzioni, 08)
  • Use a simple model of relationships in English to
    label extractions
  • Bootstrap a general model of relationships in
    English sentences, encoded as a CRF
  • Decompose each sentence into one or more (NP1,
    VP, NP2) chunks
  • Use CRF model to retain relevant parts of each NP
    and VP.
  • The extractor is relation-independent!

21
TextRunner Extraction
  • Extract Triple representing binary relation
    (Arg1, Relation, Arg2) from sentence.
  • Internet powerhouse, EBay, was originally founded
    by Pierre Omidyar.
  • Internet powerhouse, EBay, was originally founded
    by Pierre Omidyar.
  • (Ebay, Founded by, Pierre Omidyar)

22
Numerous Extraction Challenges
  • Drop non-essential info
  • was originally founded by ? founded by
  • Retain key distinctions
  • Ebay founded by Pierr ? Ebay founded Pierre
  • Non-verb relationships
  • George Bush, president of the U.S
  • Synonymy aliasing
  • Albert Einstein Einstein ? Einstein Bros.

23
TextRunner (Webs 1st Open IE system)
  • Self-Supervised Learner automatically labels
    example extractions learns an extractor
  • Single-Pass Extractor single pass over corpus,
    identifying extractions in each sentence
  • Query Processor indexes extractions? enables
    queries at interactive speeds

24
TextRunner Demo
25
(No Transcript)
26
(No Transcript)
27
Sample of 9 million Web Pages
  • Concrete facts (Oppenheimer, taught at,
    Berkeley)
  • Abstract facts (fruit, contain, vitamins)

28
3. Tractable Inference
  • Much of textual information is implicit
  • Entity and predicate resolution
  • Probability of correctness
  • Composing facts to draw conclusions

29
I. Entity Resolution
  • Resolver (Yates Etzioni, HLT 07) determines
    synonymy based on relations found by TextRunner
    (cf. Pantel Lin 01)
  • (X, born in, 1941) (M, born in, 1941)
  • (X, citizen of, US) (M, citizen of, US)
  • (X, friend of, Joe) (M, friend of, Mary)
  • P(X M) shared relations

30
Relation Synonymy
  • (1, R, 2)
  • (2, R 4)
  • (4, R, 8)
  • Etc.
  • (1, R 2)
  • (2, R, 4)
  • (4, R 8)
  • Etc.
  • P(R R) shared argument pairs
  • Unsupervised probabilistic model
  • O(N log N) algorithm run on millions of docs

31
II. Probability of Correctness
  • How likely is an extraction to be correct?
  • Factors to consider include
  • Authoritativeness of source
  • Confidence in extraction method
  • Number of independent extractions

32
Counting Extractions
  • Lexico-syntactic patterns (Hearst 92)
  • cities such as Seattle, Boston, and
  • Turneys PMI-IR, ACL 02
  • PMI co-occur frequency ? results
  • results ? confidence in class membership.

33
Formal Problem Statement
  • If an extraction x appears k times in a set of n
    distinct sentences each suggesting that x belongs
    to C, what is the probability that x ? C ?
  • C is a class (cities) or a relation (mayor
    of)
  • Note we only count distinct sentences!

34
Combinatorial Model (Urns)
Odds increase exponentially with k, but decrease
exponentially with n See Downey et al.s IJCAI
05 paper for formal details.
35
Performance (15x Improvement)
Self supervised, domain independent method
36
URNS limited on sparse facts
context
37
Language Models to the Rescue (Downey,
Schoenmackers, Etzioni, ACL 07)
  • Instead of only lexico-syntactic patterns,
    leverage all contexts of a particular entity
  • Statistical type check
  • does Pickerington behave like a city?
  • does Shaver behave like a mayor?
  • Language model HMM (built once per corpus)
  • Project string to point in 20-dimensional space
  • Measure proximity of Pickerington to Seattle,
    Boston, etc.

38
III Compositional Inference (work in progress,
Schoenmackers, Etzioni, Weld)
  • Implicit information, (224)
  • TextRunner (Turing, born in, London)
  • WordNet (London, part of, England)
  • Rule born in is transitive thru part of
  • Conclusion (Turing, born in, England)
  • Mechanism MLN instantiated on the fly
  • Rules learned from corpus (future work)
  • Inference Demo

39
KnowItAll Family Tree
Mulder 01
WebKB 99 PMI-IR 01
KnowItAll, 04
Opine 05
BE 05
Urns
Woodward 06
KnowItNow 05
Resolver 07
TextRunner 07
REALM 07
Inference 08
40
KnowItAll Team
  • Michele Banko
  • Michael Cafarella
  • Doug Downey
  • Alan Ritter
  • Dr. Stephen Soderland
  • Stefan Schoenmackers
  • Prof. Dan Weld
  • Mausam
  • Alumni Dr. Ana-Maria Popescu, Dr. Alex Yates,
    and others.

41
Related Work
  • Sekines pre-empty IE
  • Powerset
  • Textual Entailment
  • AAAI 07 Symposium on Machine Reading
  • Growing body of work on IE from the Web

42
4. Conclusions
  • Imagine search systems that operate over a (more)
    semantic space
  • Key words, documents ? extractions
  • TF-IDF, pagerank ? relational models
  • Web pages, hyper links ? entities, relns
  • Reading the Web ? new Search Paradigm

43
Thank you
44
  • Machine Reading Unsupervised understanding of
    text
  • Much is implicit ? tractable inference is
    key!

45
HMM in more detail
Training seek to maximize probability of corpus
w given latent states t using EM
ti
ti1
ti2
ti3
ti4
wi
wi1
wi2
wi3
wi4
cities such as Los
Angeles
46
Using the HMM at Query Time
  • Given a set of extractions (Arg1, Rln, Arg2)
  • Seeds most frequent Args for Rln
  • Distribution over t is read from the HMM
  • Compute KL divergence via f(arg, seeds)
  • For each extraction, average f over Arg1 Arg2
  • Sort sparse extractions in ascending order

47
Language Modeling Open IE
  • Self supervised
  • Illuminating phrases ? full context
  • Handles sparse extractions

48
Focus Open IE on Web Text
Advantages
Challenges

Semantically tractable sentences
Redundancy Search engines
Difficult, ungrammatical sentences
Unreliable information Heterogeneous corpus

49
II. Probability of Correctness
  • How likely is an extraction to be correct?
  • Distributional Hypothesis words that occur in
    the same contexts tend to have similar meanings
  • KnowItAll Hypothesis extractions that occur in
    the same informative contexts more frequently are
    more likely to be correct.

50
Argument Type checking via HMM
  • Relations arguments are typed
  • (Person, Mayor Of, City)
  • Training Model distribution of Person City
    contexts in corpus (Distributional Hypothesis)
  • Query time Rank sparse triples by how well each
    arguments context distribution matches that of
    its type

51
Silly Example
  • (Shaver, Mayor of, Pickerington) over (Spice
    Girls, Mayor of, Microsoft)
  • Because
  • Shavers contexts are more like other mayors
    than Spice Girls, and
  • Pickerington's contexts are more like other
    cities than Microsofts

52
Utilizing HMMs to Check Types
  • Challenges
  • Argument types are not known
  • Cant build model for each argument type
  • Textual types are fuzzy
  • Solution Train an HMM for the corpus using EM
    bootstrap
  • REALM improves precision by 90

53
Query Was Turing born in England?
BornIn(X, city) -gt BornIn(X, country)
Knowledge Bases
TextRunner, WordNet
Query Formula
BornIn(Turing, England)?
Inference Rules
Find Best Query
Best KB Query
WordNet X is in England
TextRunner Turing born in X
In(London, England)
Run Query
BornIn(Turing, London) BornIn(Turing, England)
Turing was born in London
Query Results
London is in England
Find Implied Nodes Cliques
New Nodes Cliques
BornIn(Turing, London) BornIn(Turing, England)
In(London, England)
Results
Yes! Turing was born in England!
Write a Comment
User Comments (0)
About PowerShow.com