Graph-Based Methods for - PowerPoint PPT Presentation

About This Presentation
Title:

Graph-Based Methods for

Description:

Graph-Based Methods for Open Domain Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science – PowerPoint PPT presentation

Number of Views:182
Avg rating:3.0/5.0
Slides: 72
Provided by: William1406
Learn more at: http://www.cs.cmu.edu
Category:
Tags: based | graph | methods | seeds

less

Transcript and Presenter's Notes

Title: Graph-Based Methods for


1
Graph-Based Methods for Open Domain
Information Extraction
  • William W. Cohen
  • Machine Learning Dept. and Language Technologies
    Institute
  • School of Computer Science
  • Carnegie Mellon University

Joint work with Richard Wang
2
Traditional IE vs Open Domain IE
  • Goal recognize people, places, companies, times,
    dates, in NL text.
  • Supervised learning from corpus completely
    annotated with target entity class (e.g.
    people)
  • Linear-chain CRFs
  • Language- and genre-specific extractors
  • Goal recognize arbitrary entity sets in text
  • Minimal info about entity class
  • Example 1 ICML, NIPS
  • Example 2 Machine learning conferences
  • Semi-supervised learning
  • from very large corpora (WWW)
  • Graph-based learning methods
  • Techniques are largely language-independent (!)
  • Graph abstraction fits many languages

3
Examples with three seeds
4
Outline
  • History
  • Open-domain IE by pattern-matching
  • The bootstrapping-with-noise problem
  • Bootstrapping as a graph walk
  • Open-domain IE as finding nodes near seeds on a
    graph
  • Set expansion - from a few clean seeds
  • Iterative set expansion from many noisy seeds
  • Relational set expansion
  • Multilingual set expansion
  • Iterative set expansion from a concept name
    alone

5
History Open-domain IE by pattern-matching
(Hearst, 92)
  • Start with seeds NIPS, ICML
  • Look thru a corpus for certain patterns
  • at NIPS, AISTATS, KDD and other learning
    conferences
  • Expand from seeds to new instances
  • Repeat.until ___
  • on PC of KDD, SIGIR, and

6
Bootstrapping as graph proximity
7
Set Expansion for Any Language (SEAL) (Wang
Cohen, ICDM 07)
  • Basic ideas
  • Dynamically build the graph using queries to the
    web
  • Constrain the graph to be as useful as possible
  • Be smart about queries
  • Be smart about patterns use clever methods for
    finding meaningful structure on web pages

8
System Architecture
  1. Pentax
  2. Sony
  3. Kodak
  4. Minolta
  5. Panasonic
  6. Casio
  7. Leica
  8. Fuji
  9. Samsung
  1. Canon
  2. Nikon
  3. Olympus
  • Fetcher download web pages from the Web that
    contain all the seeds
  • Extractor learn wrappers from web pages
  • Ranker rank entities extracted by wrappers

9
The Extractor
  • Learn wrappers from web documents and seeds on
    the fly
  • Utilize semi-structured documents
  • Wrappers defined at character level
  • Very fast
  • No tokenization required thus language
    independent
  • Wrappers derived from doc d applied to d only
  • See ICDM 2007 paper for details

10
.. Generally lta reffinance/fordgtFordlt/agt sales
compared to lta reffinance/hondagtHondalt/agt
while lta hreffinance/gmgtGeneral Motorslt/agt and
lta hreffinance/bentleygtBentleylt/agt .
  • Find prefix of each seed and put in reverse
    order
  • ford1 /ecnaniffer agt yllareneG
  • Ford2 gtdrof/ /ecnaniffer agt yllareneG
  • honda1 /ecnaniffer agt ot derapmoc
  • Honda2 gtadnoh/ /ecnaniffer agt ot
  • Organize these into a trie, tagging each node
    with a set of seeds

yllareneG
f1
f1,h1
/ecnaniffer agt
ot derapmoc
h1

gt
drof/ /ecnaniffer agt yllareneG..
f2
f1,f2,h1,h2
f2,h2
adnoh/ /ecnaniffer agt ot ..
h2
11
.. Generally lta reffinance/fordgtFordlt/agt sales
compared to lta reffinance/hondagtHondalt/agt
while lta hreffinance/gmgtGeneral Motorslt/agt and
lta hreffinance/bentleygtBentleylt/agt .
  1. Find prefix of each seed and put in reverse
    order
  2. Organize these into a trie, tagging each node
    with a set of seeds.
  3. A left context for a valid wrapper is a node
    tagged with one instance of each seed.

yllareneG
f1
f1,h1
/ecnaniffer agt
ot derapmoc
h1

gt
drof/ /ecnaniffer agt yllareneG..
f2
f1,f2,h1,h2
f2,h2
adnoh/ /ecnaniffer agt ot ..
h2
12
.. Generally lta reffinance/fordgtFordlt/agt sales
compared to lta reffinance/hondagtHondalt/agt
while lta hreffinance/gmgtGeneral Motorslt/agt and
lta hreffinance/bentleygtBentleylt/agt .
  1. Find prefix of each seed and put in reverse
    order
  2. Organize these into a trie, tagging each node
    with a set of seeds.
  3. A left context for a valid wrapper is a node
    tagged with one instance of each seed.
  4. The corresponding right context is the longest
    common suffix of the corresponding seed instances.

gt
gtFordlt/agt sales
yllareneG
f1
f1,h1
/ecnaniffer agt
ot derapmoc
gtHondalt/agt while
h1

gt
drof/ /ecnaniffer agt yllareneG..
f2
f1,f2,h1,h2
f2,h2
adnoh/ /ecnaniffer agt ot ..
h2
lt/agt
13
  • Nice properties
  • There are relatively few nodes in the trie
  • O((seeds)(document length))
  • You can tag every node with the complete set of
    seeds that it covers
  • You can rank of filter nodes by any predicate
    over this set of seeds you want e.g.,
  • covers all seed instances that appear on the
    page?
  • covers at least one instance of each seed?
  • covers at least k instances, instances with
    weight gt w,

gt
gtFordlt/agt sales
yllareneG
f1
f1,h1
/ecnaniffer agt
ot derapmoc
gtHondalt/agt while
h1

gt
drof/ /ecnaniffer agt yllareneG..
f2
f1,f2,h1,h2
f2,h2
adnoh/ /ecnaniffer agt ot ..
h2
lt/agt
14
I am noise
Me too!
15
Differences from prior work
  • Fast character-level wrapper learning
  • Language-independent
  • Trie structure allows flexibility in goals
  • Cover one copy of each seed, cover all instances
    of seeds,
  • Works well for semi-structured pages
  • Lists and tables, pull-down menus, javascript
    data structures, word documents,
  • High-precision, low-recall data integration vs.
    High-precision, low-recall information extraction

16
The Ranker
  • Rank candidate entity mentions based on
    similarity to seeds
  • Noisy mentions should be ranked lower
  • Random Walk with Restart (GW)
  • ?

17
Googles PageRank
Inlinks are good (recommendations) Inlinks from
a good site are better than inlinks from a
bad site but inlinks from sites with many
outlinks are not as good... Good and bad
are relative.
web site xxx
web site xxx
web site xxx
web site a b c d e f g
web site pdq pdq ..
web site yyyy
web site a b c d e f g
web site yyyy
18
Googles PageRank
web site xxx
  • Imagine a pagehopper that always either
  • follows a random link, or
  • jumps to random page

web site xxx
web site a b c d e f g
web site pdq pdq ..
web site yyyy
web site a b c d e f g
web site yyyy
19
Googles PageRank(Brin Page,
http//www-db.stanford.edu/backrub/google.html)
web site xxx
  • Imagine a pagehopper that always either
  • follows a random link, or
  • jumps to random page
  • PageRank ranks pages by the amount of time the
    pagehopper spends on a page
  • or, if there were many pagehoppers, PageRank is
    the expected crowd size

web site xxx
web site a b c d e f g
web site pdq pdq ..
web site yyyy
web site a b c d e f g
web site yyyy
20
Personalized PageRank (aka Random Walk with
Restart)
web site xxx
  • Imagine a pagehopper that always either
  • follows a random link, or
  • jumps to particular page

web site xxx
web site a b c d e f g
web site pdq pdq ..
web site yyyy
web site a b c d e f g
web site yyyy
21
Personalize PageRankRandom Walk with Restart
web site xxx
  • Imagine a pagehopper that always either
  • follows a random link, or
  • jumps to a particular page P0
  • this ranks pages by the total number of paths
    connecting them to P0
  • with each path downweighted exponentially with
    length

web site xxx
web site a b c d e f g
web site pdq pdq ..
web site yyyy
web site a b c d e f g
web site yyyy
22
The Ranker
  • Rank candidate entity mentions based on
    similarity to seeds
  • Noisy mentions should be ranked lower
  • Random Walk with Restart (GW)
  • On what graph?

23
Building a Graph
ford, nissan, toyota
Wrapper 2
find
northpointcars.com
extract
curryauto.com
derive
chevrolet 22.5
volvo chicago 8.4
Wrapper 1
honda 26.1
Wrapper 3
Wrapper 4
acura 34.6
bmw pittsburgh 8.4
  • A graph consists of a fixed set of
  • Node Types seeds, document, wrapper, mention
  • Labeled Directed Edges find, derive, extract
  • Each edge asserts that a binary relation r holds
  • Each edge has an inverse relation r-1 (graph is
    cyclic)
  • Intuition good extractions are extracted by many
    good wrappers, and good wrappers extract many
    good extractions,

24
Differences from prior work
  • Graph-based distances vs. bootstrapping
  • Graph constructed on-the-fly
  • So its not different?
  • But there is a clear principle about how to
    combine results from earlier/later rounds of
    bootstrapping
  • i.e., graph proximity
  • Fewer parameters to consider
  • Robust to bad wrappers

25
Evaluation Datasets closed sets
26
Evaluation Method
  • Mean Average Precision
  • Commonly used for evaluating ranked lists in IR
  • Contains recall and precision-oriented aspects
  • Sensitive to the entire ranking
  • Mean of average precisions for each ranked list

Prec(r) precision at rank r
(a) Extracted mention at r matches any true
mention (b) There exist no other extracted
mention at rank less than r that is of the same
entity as the one at r
where L ranked list of extracted mentions, r
rank
  • Evaluation Procedure (per dataset)
  • Randomly select three true entities and use their
    first listed mentions as seeds
  • Expand the three seeds obtained from step 1
  • Repeat steps 1 and 2 five times
  • Compute MAP for the five ranked lists

True Entities total number of true entities
in this dataset
27
Experimental Results 3 seeds
  • Vary Extractor Ranker Top N URLs
  • Extractor
  • E1 Baseline Extractor (longest common context
    for all seed occurrences)
  • E2 Smarter Extractor (longest common context
    for 1 occurrence of each seed)
  • Ranker EF Baseline (Most Frequent), GW
    Graph Walk
  • N URLs 100, 200, 300

28
Side by side comparisons
Telukdar, Brants, Liberman, Pereira, CoNLL 06
29
Side by side comparisons
EachMovie vs WWW
NIPS vs WWW
Ghahramani Heller, NIPS 2005
30
Why does SEAL do so well?
Free-text wrappers are only 10-15 of all
wrappers learned Used ... Van Pricing" Used
... Engines" Bell Road ... " Alaska ...
dealership" www.sunnyking....com"" engine
... used engines" accessories, ...
parts" is better ... or"
  • Hypotheses
  • More information appears in semi-structured
    documents than in free text
  • More semi-structured documents can be (partially)
    understood with character-level wrappers than
    with HTML-level wrappers

31
Comparing character tries to HTML-based structures
32
Outline
  • History
  • Open-domain IE by pattern-matching
  • The bootstrapping-with-noise problem
  • Bootstrapping as a graph walk
  • Open-domain IE as finding nodes near seeds on a
    graph
  • Set expansion - from a few clean seeds
  • Iterative set expansion from many noisy seeds
  • Iterative set expansion from a concept name
    alone
  • Multilingual set expansion
  • Relational set expansion

33
A limitation of the original SEAL
34
Proposed Solution Iterative SEAL (iSEAL)(Wang
Cohen, ICDM 2008)
  • Makes several calls to SEAL, each call
  • Expands a couple of seeds
  • Aggregates statistics
  • Evaluate iSEAL using
  • Two iterative processes
  • Supervised vs. Unsupervised (Bootstrapping)
  • Two seeding strategies
  • Fixed Seed Size vs. Increasing Seed Size
  • Five ranking methods

35
ISeal (Fixed Seed Size, Supervised)
Initial Seeds
  • Finally rank nodes by proximity to seeds in the
    full graph
  • Refinement (ISS) Increase size of seed set for
    each expansion over time 2,3,4,4,
  • Variant (Bootstrap) use high-confidence
    extractions when seeds run out

36
Ranking Methods
  • Random Graph Walk with Restart
  • H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random
    walk with restart and its application. In ICDM,
    2006.
  • PageRank
  • L. Page, S. Brin, R. Motwani, and T. Winograd.
    The PageRank citation ranking Bringing order to
    the web. 1998.
  • Bayesian Sets (over flattened graph)
  • Z. Ghahramani and K. A. Heller. Bayesian sets. In
    NIPS, 2005.
  • Wrapper Length
  • Weights each item based on the length of common
    contextual string of that item and the seeds
  • Wrapper Frequency
  • Weights each item based on the number of wrappers
    that extract the item

37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
Little difference between ranking methods for
supervised case (all seeds correct) large
differences when bootstrapping
Increasing seed size 2,3,4,4, makes all
ranking methods improve steadily in bootstrapping
case
42
(No Transcript)
43
Outline
  • History
  • Open-domain IE by pattern-matching
  • The bootstrapping-with-noise problem
  • Bootstrapping as a graph walk
  • Open-domain IE as finding nodes near seeds on a
    graph
  • Set expansion - from a few clean seeds
  • Iterative set expansion from many noisy seeds
  • Relational set expansion
  • Multilingual set expansion
  • Iterative set expansion from a concept name
    alone

44
Relational Set ExpansionWang Cohen, EMNLP
2009
  • Seed examples are pairs
  • E.g., audigermany, acurajapan,
  • Extension find wrappers in which pairs of seeds
    occur
  • With specific left right contexts
  • In specific order (audi before germany, )
  • With specific string between them
  • Variant of trie-based algorithm

45
Results
First iteration
Tenth iteration
46
Outline
  • History
  • Open-domain IE by pattern-matching
  • The bootstrapping-with-noise problem
  • Bootstrapping as a graph walk
  • Open-domain IE as finding nodes near seeds on a
    graph
  • Set expansion - from a few clean seeds
  • Iterative set expansion from many noisy seeds
  • Relational set expansion
  • Multilingual set expansion
  • Iterative set expansion from a concept name
    alone

47
Multilingual Set Expansion
48
Multilingual Set Expansion
  • Basic idea
  • Expand in language 1 (English) with seeds s1,s2
    to S1
  • Expand in language 2 (Spanish) with seeds t1,t2
    to T1.
  • Find first seed s3 in S1 that has a translation
    t3 in T1.
  • Expand in language 1 (English) with seeds
    s1,s2,s3 to S2
  • Find first seed t4 in T1 that has a translation
    s4 in S2.
  • Expand in language 2 (Sp.) with seeds t1,t2,t3 to
    T2.
  • Continue.

49
Multilingual Set Expansion
  • Whats needed
  • Set expansion in two languages
  • A way to decide if s is a translation of t

50
Multilingual Set Expansion
  • Submit s as a query and ask for results in
    language T.
  • Find chunks in language T in the snippets that
    frequently co-occur with s
  • Bounded by change in character set (eg English
    to Chinese) or punctuation
  • Rank chunks by combination of proximity
    frequency
  • Consider top 3 chunks t1, t2, t3 as likely
    translations of s.

51
Multilingual Set Expansion
52
Multilingual Set Expansion
53
Outline
  • History
  • Open-domain IE by pattern-matching
  • The bootstrapping-with-noise problem
  • Bootstrapping as a graph walk
  • Open-domain IE as finding nodes near seeds on a
    graph
  • Set expansion - from a few clean seeds
  • Iterative set expansion from many noisy seeds
  • Relational set expansion
  • Multilingual set expansion
  • Iterative set expansion from a concept name
    alone

54
ASIA Automatic set instance acquisitionWang
Cohen, ACL 2009
  • Start with name of concept (e.g., NFL teams)
  • Look for instances using (language-dependent)
    patterns
  • for successful NFL teams (e.g., Pittsburgh
    Steelers, New York Giants, )
  • Take most frequent answers as seeds
  • Run bootstrapping iSEAL
  • with seed sizes 2,3,4,4.
  • and extended for noise-resistance
  • wrappers should cover as many distinct seeds as
    possible (not all seeds)
  • subject to a limit on size
  • Modified trie method

55
(No Transcript)
56
Datasets with concept names
57
(No Transcript)
58
Experimental results
Direct use of text patterns
59
Comparison to Kozareva, Riloff Hovy (which uses
concept name plus a single instance as seed)no
seed used.
60
Comparison to Pasca (using web search queries,
CIKM 07)
61
Comparison to WordNet Nk
  • Snow et al, ACL 2005 series of experiments
    learning hyper/hyponyms
  • Bootstrap from Wordnet examples
  • Use dependency-parsed free text
  • E.g., added 30k new instances with fairly high
    precision
  • Many are concepts named-entity instances
  • Experiments with ASIA on concepts from Wordnet
    shows a fairly common problem
  • E.g., movies gives as instances comedy,
    action/adventure, family, drama, .
  • I.e., ASIA finds a lower level in a hierarchy,
    maybe not the one you want

62
Comparison to WordNet Nk
  • Filter a simulated sanity check
  • Consider only concepts expanded in Wordnet 30k
    that seem to have named-entities as instances and
    have at least instances
  • Run ASIA on each concept
  • Discard result if less than 50 of the Wordnet
    instances are in ASIAs output

63
  • Summary
  • Some are good
  • Some of Snows concepts are low-precision
    relative to ASIA (4.7 ? 100)
  • For the rest ASIA has 2x ? 100x the coverage (in
    number of instances)

64
Two More Systems to Compare to
  • Van Durme Pasca, 2008
  • Requires an English part-of-speech tagger.
  • Analyzed 100 million cached Web documents in
    English (for many classes).
  • Talukdar et al, 2008
  • Requires 5 seed instances as input (for each
    class).
  • Utilizes output from Van Durmes system and 154
    million tables from the WebTables database (for
    many classes).
  • ASIA
  • Does not require any part-of-speech tagger
    (nearly language-independent).
  • Supports multiple languages such as English,
    Chinese, and Japanese.
  • Analyzes around 200400 Web documents (for each
    class).
  • Requires only the class name as input.
  • Given a class name, extraction usually finishes
    within a minute (including network latency of
    fetching web pages).

65
  • Precisions of Talukdar and Van Durmes systems
    were obtained from Figure 2 in Talukdar et al,
    2008.

66
(for your reference)
67
Top 10 Instances from ASIA
68
(No Transcript)
69
  • Joint work with Tom Mitchell, Weam AbuZaki,
    Justin Betteridge, Andrew Carlson, Estevam R.
    Hruschka Jr., Bryan Kisiel, Burr Settles
  • Learn a large number of concepts at once

teamPlaysSport(t,s)
playsForTeam(a,t)
person
sport
playsSport(a,s)
team
athlete
coach
coachesTeam(c,t)
NP1
NP2
Krzyzewski coaches the Blue Devils.
70
Coupled learning of text and HTML patterns
evidence integration
CBL Free-text extraction patterns
SEAL HTML extraction patterns
Ontology and populated KB
the Web
71
(No Transcript)
72
Summary/Conclusions
  • Open-domain IE as finding nodes near seeds on a
    graph

NIPS
SNOWBIRD
at NIPS, AISTATS, KDD and other learning
conferences
For skiiers, NIPS, SNOWBIRD, and
AISTATS
SIGIR
KDD
on PC of KDD, SIGIR, and
AISTATS,KDD,
  • RWR as robust proximity measure
  • Character tries as flexible pattern language
  • high-coverage
  • modifiable to handle expectations of noise

73
Summary/Conclusions
  • Open-domain IE as finding nodes near seeds on a
    graph
  • Graph built on-the-fly with web queries
  • A good graph matters!
  • A big graph matters!
  • character-level tries gtgt HTML heuristics
  • Rank the whole graph
  • Dont confuse iteratively building the graph with
    ranking!
  • Off-the-shelf distance metrics work
  • Differences are minimal for clean seeds
  • Much bigger differences with noisy seeds
  • Bootstrapping (especially from free-text
    patterns) is noisy

74
Thanks to
  • DARPA PAL program
  • Cohen, Wang
  • Google Research Grant program
  • Wang

Sponsored links http//boowa.com (Richards
demo)
Write a Comment
User Comments (0)
About PowerShow.com