Title: Snowball : Extracting Relations from Large Plain-Text Collections
1Snowball Extracting Relations from Large
Plain-Text Collections
- Eugene AgichteinLuis GravanoDepartment of
Computer ScienceColumbia University
2Extracting Relations from Documents
Text documents hide valuable structured
information.
- If we manage to extract this information
- We can answer user queries more accurately
- We can run data mining tasks (e.g., finding
trends)
3GOAL Extract All Tuples Hidden in the Document
Collection
- System must
- Require minimal training for each new task
- Recover from noise
- Exploit redundancy of information in documents
4Example Task Organization/Location
Redundancy
Organization
Location
Microsoft's central headquarters in Redmond is
home to almost every product group and division.
Microsoft Apple Computer Nike
Redmond Cupertino Portland
Brent Barlow, 27, a software analyst and
beta-tester at Apple Computer headquarters in
Cupertino, was fired Monday for "thinking a
little too different."
Apple's programmers "think different" on a
"campus" in Cupertino, Cal. Nike employees "just
do it" at what the company refers to as its
"World Campus," near Portland, Ore.
5Extracting Relations from Text Collections
- Related Work
- The Snowball System
- Evaluation Metrics
- Experimental Results
6Related Work
- Traditional Information Extraction
- MUCs (Message Understanding Conferences)
- Significant (manual) training for each new task
- Bootstrapping
- Riloff et al. (99), Collins Singer (99)
- (Named-entity recognition)
- Brin (DIPRE) (98)
7Extracting Relations from Text DIPRE
Initial Seed Tuples
Initial Seed Tuples
Occurrences of Seed Tuples
Generate New Seed Tuples
Generate Extraction Patterns
Augment Table
8Extracting Relations from Text DIPRE
Computer servers at Microsofts headquarters in
Redmond In mid-afternoon trading, share
ofRedmond-based Microsoft fell The Armonk-based
IBM introduceda new line The combined company
will operate from Boeings headquarters in
Seattle. Intel, Santa Clara, cut prices of
itsPentium processor.
Occurrences of seed tuples
Initial Seed Tuples
Occurrences of Seed Tuples
Generate New Seed Tuples
Generate Extraction Patterns
Augment Table
9Extracting Relations from Text DIPRE
- ltSTRING1gts headquarters in ltSTRING2gt
- ltSTRING2gt -based ltSTRING1gt
- ltSTRING1gt , ltSTRING2gt
DIPREPatterns
Initial Seed Tuples
Occurrences of Seed Tuples
Generate New Seed Tuples
Generate Extraction Patterns
Augment Table
10Extracting Relations from Text DIPRE
Generatenew seedtuples start newiteration
Initial Seed Tuples
Occurrences of Seed Tuples
Generate New Seed Tuples
Generate Extraction Patterns
Augment Table
11Extracting Relations from Text Potential
Pitfalls
- Invalid tuples generated
- Degrade quality of tuples on subsequent
iterations - Must have automatic way to selecthigh quality
tuples to use as new seed - Pattern representation
- Patterns must generalize
12Extracting Relations from Text Collections
- Related Work
- DIPRE
- The Snowball System
- Pattern representation and generation
- Tuple generation
- Automatic pattern and tuple evaluation
- Evaluation Metrics
- Experimental Results
13Extracting Relations from Text Snowball
Initial Seed Tuples
Initial Seed Tuples
Occurrences of Seed Tuples
Tag Entities
Generate New Seed Tuples
Generate Extraction Patterns
Augment Table
14Extracting Relations from Text Snowball
Computer servers at Microsofts headquarters in
Redmond In mid-afternoon trading, share
ofRedmond-based Microsoft fell The Armonk-based
IBM introduceda new line The combined company
will operate from Boeings headquarters in
Seattle. Intel, Santa Clara, cut prices of
itsPentium processor.
Occurrences of seed tuples
Initial Seed Tuples
Occurrences of Seed Tuples
Tag Entities
Generate New Seed Tuples
Generate Extraction Patterns
Augment Table
15Problem Patterns Excessively General
Pattern ltSTRING2gt-based ltSTRING1gt
Today's merger with McDonnell Douglaspositions
Seattle -based Boeing to make major money in
space.
, a producer of apple-based jelly, ...
Incorrect!
ltjelly, applegt
16Extracting Relations from Text Snowball
Computer servers at Microsofts headquarters in
Redmond In mid-afternoon trading, share
ofRedmond-based Microsoft fell The Armonk-based
IBM introduceda new line The combined company
will operate from Boeings headquarters in
Seattle. Intel, Santa Clara, cut prices of
itsPentium processor.
Tag Entities
Use MITREs Alembic Named Entity tagger
Initial Seed Tuples
Occurrences of Seed Tuples
Tag Entities
Generate New Seed Tuples
Generate Extraction Patterns
Augment Table
17Extracting Relations from Text
- ltORGANIZATIONgts headquarters in ltLOCATIONgt
- ltLOCATIONgt -based ltORGANIZATIONgt
- ltORGANIZATIONgt , ltLOCATIONgt
PROBLEM Patterns too specific have to match
text exactly.
Initial Seed Tuples
Occurrences of Seed Tuples
Tag Entities
Generate New Seed Tuples
Generate Extraction Patterns
Augment Table
18Snowball Pattern Representation
- A Snowball pattern vector is a 5-tuple ltleft,
tag1, middle, tag2, rightgt, - tag1, tag2 are named-entity tags
- left, middle, and right are vectors of weighed
terms.
ORGANIZATION 's central headquarters in
LOCATION is home to...
lt's 0.5gt, ltcentral 0.5gt ltheadquarters 0.5gt, lt
in 0.5gt
ltis 0.75gt, lthome 0.75gt
LOCATION
ORGANIZATION
lt left , tag1 , middle , tag2 , right gt
19Snowball Pattern Generation
Tagged Occurrences of seed tuples
Computer servers at Microsofts central
headquarters in Redmond
In mid-afternoon trading, share of Redmond-based
Microsoft fell
The Armonk -based IBM introduced a new line
The combined company will operate from Boeings
headquarters in Seattle.
20Snowball Pattern Generation Cluster Similar
Occurrences
Occurrences of seed tuples converted to Snowball
representation
ltservers 0.75gtltat 0.75gt
lts 0.5gt ltcentral 0.5gt ltheadquarters 0.5gt ltin
0.5gt
LOCATION
ORGANIZATION
ltshares 0.75gtltof 0.75gt
lt- 0.75gt ltbased 0.75gt
ltfell 1gt
ORGANIZATION
LOCATION
ltthe 1gt
lt- 0.75gt ltbased 0.75gt
ltintroduced 0.75gt lta 0.75gt
LOCATION
ORGANIZATION
ltoperate 0.75gtltfrom 0.75gt
lts 0.7gt ltheadquarters 0.7gt ltin 0.7gt
LOCATION
ORGANIZATION
21Similarity Metric
P
lt Lp , tag1 , Mp , tag2 , Rp gt
S
lt Ls , tag1 , Ms , tag2 , Rs gt
Match(P, S)
Lp . Ls Mp . Ms Rp . Rs if the tags
match
0 otherwise
22Snowball Pattern Generation Clustering
Cluster 1
ltservers 0.75gtltat 0.75gt
lts 0.5gt ltcentral 0.5gt ltheadquarters 0.5gt ltin
0.5gt
LOCATION
ORGANIZATION
ltoperate 0.75gtltfrom 0.75gt
lts 0.7gt ltheadquarters 0.7gt ltin 0.7gt
LOCATION
ORGANIZATION
Cluster 2
ltshares 0.75gtltof 0.75gt
lt- 0.75gt ltbased 0.75gt
ltfell 1gt
ORGANIZATION
LOCATION
ltthe 1gt
lt- 0.75gt ltbased 0.75gt
ltintroduced 0.75gt lta 0.75gt
LOCATION
ORGANIZATION
23Snowball Pattern Generation
Patterns are formed as centroids of the clusters.
Filtered by minimum number of supporting tuples.
lts 0.7gt ltin 0.7gt ltheadquarters 0.7gt
LOCATION
ORGANIZATION
Pattern1
lt- 0.75gt ltbased 0.75gt
LOCATION
ORGANIZATION
Pattern2
24Snowball Tuple Extraction
Using the patterns, scan the collection to
generate new seed tuples
Initial Seed Tuples
Occurrences of Seed Tuples
Tag Entities
Generate New Seed Tuples
Generate Extraction Patterns
Augment Table
25Snowball Tuple Extraction
- Represent each new text segment in the
collection as the context 5-tuple - Find most similar pattern (if any)
Netscape 's flashy
headquarters in Mountain View is near
lt's 0.5gt, ltflashy 0.5gt, ltheadquarters 0.5gt,
lt in 0.5gt
LOCATION
ltis 0.75gt, ltnear 0.75gt
ORGANIZATION
lt's 0.7gt, ltheadquarters 0.7gt, lt in 0.7gt
LOCATION
ORGANIZATION
26Snowball Automatic Pattern Evaluation
Seed tuples
Pattern ORGANIZATION, LOCATION in action
Boeing, Seattle, said Positive Intel, Santa
Clara, cut prices Positive invest in Microsoft,
New York-based Negativeanalyst Jane Smith said
- Automatically estimate probability ofa pattern
generating valid tuplesConf(Pattern)
_____Positive____
Positive Negativee.g., Conf(Pattern) 2/3
66
PatternConfidence
27Snowball Automatic Tuple Evaluation
Brent Barlow, 27, a software analyst and
beta-tester at Apple Computer headquarters in
Cupertino, was fired Monday for "thinking a
little too different."
ltApple Computer, Cupertinogt
Apple's programmers "think different" on a
"campus" in Cupertino, Cal.
- Conf(Tuple) 1 - ?(1 -Conf(Pi))
- Estimation of Probability (Correct (Tuple) )
- A tuple will have high confidence ifgenerated by
multiple high-confidencepatterns (Pi).
28Snowball Filtering Seed Tuples
Generatenew seedtuples
Initial Seed Tuples
Occurrences of Seed Tuples
Tag Entities
Generate New Seed Tuples
Generate Extraction Patterns
Augment Table
29Extracting Relations from Text Collections
- Related Work
- The Snowball System
- Pattern representation and generation
- Tuple generation
- Automatic pattern and tuple evaluation
- Evaluation Metrics
- Experimental Results
30Task Evaluation Methodology
- Data Large collection, extracted tablescontain
many tuples (gt 80,000) - Need scalable methodology
- Ideal set of tuples
- Automatic recall/precision estimation
- Estimated precision using sampling
31Collections used in Experiments
More than 300,000 real newspaper articles
32The Ideal Metric (1)
- Creating the Ideal set of tuples
All tuples mentioned in the collection
Hoovers directory(13K organizations)
Ideal
A perfect, (ideal) system would be able to
extract all these tuples
33The Ideal Metric (2)
Correctlocationfound
Extracted
Ideal
- Precision Correct (Extracted ? Ideal)
Extracted ? Ideal - Recall Correct (Extracted ? Ideal)
Ideal
34Estimate Precision by Sampling
- Sample extracted table
- Random samples, each 100 tuples
- Manually check validity of tuples in eachsample
35Extracting Relations from Text Collections
- Related Work
- The Snowball System
- Pattern representation and generation
- Tuple generation
- Automatic pattern and tuple validation
- Evaluation Metrics
- Experimental Results
36Experimental results Test Collection
(b)
(a)
Recall (a) and precision (a) using the Ideal
metric, plotted against the minimal number of
occurrences of test tuples in the collection
37Experimental results Sample and Check
(a)
(b)
Recall (a) and precision (b) for varying minimum
confidence threshold Tt. NOTE Recall is
estimated using the Ideal metric, precision is
estimated by manually checking random samples of
result table.
38Conclusions
- We presented
- Our Snowball system
- Requires minimal training (handful of seed
tuples) - Uses a flexible pattern representation
- Achieves high recall/precision gt 80 of test
tuples extracted - Scalable evaluation methodology
39Recent and Future Work
- Recent (presented in DMKD00 workshop)
- Alternative pattern representation
- Combining representations
- Future Work
- Evaluation on other extraction tasks
- Extensions
- Non-binary relations
- Relations with no key
- ? HTML documents
40Snowball Extracting Relations from Large
Plain-Text Collections
- Eugene Agichtein (eugene_at_cs.columbia.edu)Luis
GravanoDepartment of Computer ScienceColumbia
University
41Backup Slides
42Snowball Solutions
- Flexible pattern representation
- Pattern generation
- Automatic pattern and tuple evaluation
- Able to recover from noise
- Keeps only high quality tuples as new seed
43Experimental Results Training
(a)
(b)
Recall (a) and precision (b) using the Ideal
metric (training collection)
44Sampling Results Error Analysis
The tuples in the random samples were checked by
hand to pinpoint the culprits responsible for
incorrect tuples.Sample size is 100.
45Sample Discovered Patterns
46Convergence of Snowball and DIPRE
(b)
(a)
Precision (a) and Recall (b) of the DIPRE and
Snowballwith increased iterations
47Approximate Matching of Organizations
- Use Whirl (W. Cohen _at_ ATT) to match similar
organization names - Self-join the Extracted table on the Organization
attribute - Join resulting table with the Test table, and
compare values ofLocation attributes
Extracted
Extracted
48References
- Blum Mitchell. Combining labeled and unlabeled
data with co-training. Proceedings of 1998
Conference on Computational Learning Theory. - Brin. Extracting patterns and relations from the
World-Wide Web. Proceedings on the 1998
International Workshop on Web and Databases
(WebDB98). - Collins Singer. Unsupervised models for named
entity classification. EMNLP 1999. - Riloff Jones. Learning dictionaries for
information extraction by multi-level
bootstrapping. AAAI99. - Yarowsky. Unsupervised word sense disambiguation
rivaling supervised methods. ACL95.