Title: Information Extraction from Wikipedia:
1Information Extraction from Wikipedia Moving
Down the Long Tail
Fei Wu, Raphael Hoffmann, Daniel S.
Weld Department of Computer Science
Engineering University of Washington Seattle, WA,
USA
Intelligence in Wikipedia
Fei Wu, Eytan Adar, Saleema Amershi, Oren
Etzioni, James Fogarty, Raphael Hoffmann, Kayur
Patel, Stef Schoenmackers Michael Skinner
2Motivating Vision
- Next-Generation Search
- Information Extraction Ontology Inference
Which performing artists were born in Chicago?
3Next-Generation Search
- Information Extraction
- ltBob, Born-In, NMHgt
- ltBob Black, ISA, actorgt
- ltNMH, in Chicagogt
- Ontology
- Actor ISA Performing Artist
- Inference
- Born-In(A) PartOf(A,B)
- gt Born-In(B)
4Wikipedia Bootstrap for the Web
- Goal search over the Web
- Now search over Wikipedia
- Comprehensive
- High-quality
- (Semi-)Structured data
5Infoboxes
- Infoboxes are designed to present summary
information about an article's subject, such that
similar subjects have a uniform look and in a
common format - An infobox is a generalization of a taxobox (from
taxonomy) which summarizes information for an
organism or group of organisms.
6Infobox examples
Basic infobox
Taxobox Plant species
7More example
Infobox People - Actor
Infobox- Convention Center
8Outline
- Background Kylin Extraction
- Long-Tailed Challenges
- Sparse infobox classes
- Incomplete articles
- Moving Down the Long Tails
- Shrinkage
- Retraining
- Extracting from the Web
- Problem with information Extraction
- IWP (Intelligence in Wikipedia)
- CCC and IE
- Virtuous Cycle
- IWP (Shrinkage, Retraining and Extracting from
Web) - Multilingual Extraction
- Summary
9 Kylin Autonomously Semantifying Wikipedia
- Totally autonomous with no additional human
efforts - Form training dataset based on infoboxes
- Extract semantic relations from Wikipedia articles
Kylin a mythical hooved Chinese chimerical
creature that is said to appear in conjunction
with the arrival of a sage. ------Wikipedia
10Kylin
- It is a prototype of self-supervised, machine
learning system - It looks for classes of pages with similar
infoboxes - It determines common attributes
- It creates training examples
11Infobox Generation
12Preprocessor
- Schema Refinement Free edit -gt schema drift
- Duplicate templatesU.S.County(1428), US
County(574), Counties(50), County(19) - Low usage of attribute
- Duplicate attributesCensus Yr, Census
Estimate Yr, Census Est., Census Year - Kylin
- Strict name match
- 15 occurrences
13Preprocessor
- Training Dataset Construction
Clearfield County was created on 1804 from parts
of Huntingdon and Lycoming Counties but was
administered as part of Centre County until 1812.
Its county seat is Clearfield.
2,972 km² (1,147 mi²) of it is land and 17 km²
 (7 mi²) of it (0.56) is water.
As of 2005, the population density was 28.2/km².
14Classifier
- Document Classifier
- List and Category
- Fast
- Precision(98.5)
- Recall(68.8)
- Sentence Classifier
- Predicts which attribute value are contained in
given sentence. - It uses maximum entropy model.
- To decrease noisy and incomplete training
dataset, Kylin apply bagging.
15CRF Extractor
- Conditional Random Fields Model
- Attribute value extraction sequential data
labeling - CRF model for each attribute independently
- Relabelfilter false negative training examples
- 2,972km²(1,147mi²) of it is land and 17km²(7mi²)
of it (0.56) is water. - Preprocessor Water_area
- Classifier Water_area Land_area
- Though Kylin is successful on popular classes,
its performance decreases on sparse classes where
there is insufficient training data.
16Outline
- Background Kylin Extraction
- Long-Tailed Challenges
- Sparse infobox classes
- Incomplete articles
- Moving Down the Long Tails
- Shrinkage
- Retraining
- Extracting from the Web
- Problem with information Extraction
- IWP (Intelligence in Wikipedia)
- CCC and IE
- Virtuous Cycle
- IWP (Shrinkage, Retraining and Extracting from
Web) - Multilingual Extraction
- Summary
17Long-Tail 1 Sparse Infobox Class
- Kylin Performs Well on Popular Classes
- Precision mid 70 high 90
- Recall low 50 mid 90
- Kylin Flounders on Sparse Classes Little
Training Data
e.g for US County class Kylin has 97.3
precision and 95.9 recall while many other
classes like Irish Newspaper contains very
small number of infobox containing articles
18Long-Tail 2 Incomplete Articles
- Desired Information Missing from Wikipedia
- Among 1.8 millions pages July 2007 of Wikipedia
many are short articles and almost 800,000
(44.2) are marked as stub pages indicating much
needed information is missing.
19Outline
- Background Kylin Extraction
- Long-Tailed Challenges
- Sparse infobox classes
- Incomplete articles
- Moving Down the Long Tails
- Shrinkage
- Retraining
- Extracting from the Web
- Problem with information Extraction
- IWP (Intelligence in Wikipedia)
- CCC and IE
- Virtuous Cycle
- IWP (Shrinkage, Retraining and Extracting from
Web) - Multilingual Extraction
- Summary
20Shrinkage
- Attempt to improve Kylins performance using
shrinkage. - We use Shrinkage when training an extractor of an
instance-space infobox class by aggregating data
from its parent and children classes
21Shrinkage
McCallum et al., ICML98
22Shrinkage
- KOG (Kylin Ontology Generator) Wu Weld, WWW08
-
person (1201)
performer (44)
actor (8738)
comedian (106)
.birth_place
.location
.birthplace .birth_place .cityofbirth .origin
23Outline
- Background Kylin Extraction
- Long-Tailed Challenges
- Sparse infobox classes
- Incomplete articles
- Moving Down the Long Tails
- Shrinkage
- Retraining
- Extracting from the Web
- Problem with information Extraction
- IWP (Intelligence in Wikipedia)
- CCC and IE
- Virtuous Cycle
- IWP (Shrinkage, Retraining and Extracting from
Web) - Multilingual Extraction
- Summary
24Retraining
Complementary to Shrinkage Harvest extra
training data from broader Web
- Key
- Identify relevant sentences given the sea of Web
data?
Andrew Murray was born in Scotland in 1828
ltAndrew Murray, was born in, Scotlandgt ltAndrew
Murray, was born in, 1828gt
25Retraining
Kylin Extraction TextRunner Extraction Query
TextRunner for relevant sentences
tlt Ada Cambridge, location, St Germans ,
Norfolk , Englandgt
- r1ltAda Cambridge, was born in, Englandgt
- Ada Cambridge was born in England in 1844 and
moved to Australia with her curate husband in
1870. - r2ltAda Cambridge, was born in, Norfolk ,
Englandgt - Ada Cambridge was born in Norfolk , England , in
1844 .
26Effect of Shrinkage Retraining
27Effect of Shrinkage Retraining
1755 improvement for a sparse class
13.7 improvement for a popular class
28Outline
- Background Kylin Extraction
- Long-Tailed Challenges
- Sparse infobox classes
- Incomplete articles
- Moving Down the Long Tails
- Shrinkage
- Retraining
- Extracting from the Web
- Problem with information Extraction
- IWP (Intelligence in Wikipedia)
- CCC and IE
- Virtuous Cycle
- IWP (Shrinkage, Retraining and Extracting from
Web) - Multilingual Extraction
- Summary
29Extraction from the Web
- Idea apply Kylin extractors trained on Wikipedia
to general Web pages - Challenge maintain high precision
- General Web pages are noisy
- Many Web pages describe multiple objects
- Key retrieve relevant sentences
- Procedure
- Generate a set of search engine queries
- Retrieve top-k pages from Google
- Weight extractions from these pages
30Choosing Queries
Example get birth date attribute for article
titled Andrew Murray (minister) andrew
murrayandrew murray birth dateandrew
murray was born inandrew murray
attribute name
predicatesfromTextRunner
31Weighting Extractions
- Which extractions are more relevant?
- Features
- sentences between sentence and closest
occurrence of title (andrew murray) - rank of page on Googles result lists
- Kylins extractor confidence
32Web Extraction Experiment
- Extractor confidence alone performs poor
- Weighted combination is the best
33Combining Wikipedia Web
Recall Benefit from Shrinkage / Retraining
34Combining Wikipedia Web
Benefit from Shrinkage Retraining Web
35Outline
- Background Kylin Extraction
- Long-Tailed Challenges
- Sparse infobox classes
- Incomplete articles
- Moving Down the Long Tails
- Shrinkage
- Retraining
- Extracting from the Web
- Problem with information Extraction
- IWP (Intelligence in Wikipedia)
- CCC and IE
- Virtuous Cycle
- IWP (Shrinkage, Retraining and Extracting from
Web) - Multilingual Extraction
- Summary
36Problem
- Information Extraction is Imprecise
- Wikipedians Dont Want 90 Precision
- How Improve Precision?
- People!
37Outline
- Background Kylin Extraction
- Long-Tailed Challenges
- Sparse infobox classes
- Incomplete articles
- Moving Down the Long Tails
- Shrinkage
- Retraining
- Extracting from the Web
- Problem with information Extraction
- IWP (Intelligence in Wikipedia)
- CCC and IE
- Virtuous Cycle
- IWP (Shrinkage, Retraining and Extracting from
Web) - Multilingual Extraction
- Summary
38Intelligence in Wikipedia
- What is IWP?
- A project/system that aims to combine
- IE (Information Extraction)
- CCC (communal content creation)
39Information Extraction
- Examples
- Zoominfo.com
- Fligdog.com
- Citeseer
- Google
- Advantage Autonomy
- Disadvantage Expensive
40IE system contributors
- Contributors in this room?
- Wikipedia
- IE systems
- Citeseer
- Rexa
- DBlife
41Communal Content Creation
- Examples
- Wikipedia
- Ebay
- Netflix
- Advantage more accuracy then IE
- Disadvantage bootstrapping, incentives, and
management
42Outline
- Background Kylin Extraction
- Long-Tailed Challenges
- Sparse infobox classes
- Incomplete articles
- Moving Down the Long Tails
- Shrinkage
- Retraining
- Extracting from the Web
- Problem with information Extraction
- IWP (Intelligence in Wikipedia)
- CCC and IE
- Virtuous Cycle
- IWP (Shrinkage, Retraining and Extracting from
Web) - Multilingual Extraction
- Summary
43Virtuous Cycle
44Contributing as a Non-Primary Task
- Encourage contributions
- Without annoying or abusing readers
- Compared 5 different interfaces
45(No Transcript)
46Results
- Contribution Rate
- 1.6 ? 13
- 90 of positive labels were correct
47Outline
- Background Kylin Extraction
- Long-Tailed Challenges
- Sparse infobox classes
- Incomplete articles
- Moving Down the Long Tails
- Shrinkage
- Retraining
- Extracting from the Web
- Problem with information Extraction
- IWP (Intelligence in Wikipedia)
- CCC and IE
- Virtuous Cycle
- IWP (Shrinkage, Retraining and Extracting from
Web) - Multilingual Extraction
- Summary
48IWP and Shrinkage, Retraining, and Extracting
from the Web
- Shrinkage improves IWPs precision, and recall
- Retraining improves the robustness of IWPs
extractors - Extraction further helps IWPs performance
49Multi-Lingual Extraction
- Idea Further leverage the virtuous feedback
cycle - Utilize IE methods to add or update missing
information by copying from one language to
another - Utilize CCC to validate and improve updates.
- Example
- Nombre Jerry Seinfeld and Name Jerry
Seinfeld - Cónyuge Jessica Sklar and Spouse Jessica
Sienfeld
50Summary
- Kylins initial performance is unacceptable
- Methods for increasing recall
- Shrinkage
- Retraining
- Extraction from the web
51Summary
- IWP developing AI methods to facilitate the
growth, operation and use of Wikipedia - Initial goal extraction of a giant knowledge
bas of semantic triples - Faceted browsing
- Input to reasoning based question-answering
system - How
- IE
- CCC
52Questions
?