Information Extraction from Wikipedia: - PowerPoint PPT Presentation

About This Presentation
Title:

Information Extraction from Wikipedia:

Description:

Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering – PowerPoint PPT presentation

Number of Views:178
Avg rating:3.0/5.0
Slides: 53
Provided by: OfficeofI97
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Information Extraction from Wikipedia:


1
Information Extraction from Wikipedia Moving
Down the Long Tail
Fei Wu, Raphael Hoffmann, Daniel S.
Weld Department of Computer Science
Engineering University of Washington Seattle, WA,
USA
Intelligence in Wikipedia
Fei Wu, Eytan Adar, Saleema Amershi, Oren
Etzioni, James Fogarty, Raphael Hoffmann, Kayur
Patel, Stef Schoenmackers Michael Skinner
2
Motivating Vision
  • Next-Generation Search
  • Information Extraction Ontology Inference

Which performing artists were born in Chicago?
3
Next-Generation Search
  • Information Extraction
  • ltBob, Born-In, NMHgt
  • ltBob Black, ISA, actorgt
  • ltNMH, in Chicagogt
  • Ontology
  • Actor ISA Performing Artist
  • Inference
  • Born-In(A) PartOf(A,B)
  • gt Born-In(B)

4
Wikipedia Bootstrap for the Web
  • Goal search over the Web
  • Now search over Wikipedia
  • Comprehensive
  • High-quality
  • (Semi-)Structured data

5
Infoboxes
  • Infoboxes are designed to present summary
    information about an article's subject, such that
    similar subjects have a uniform look and in a
    common format
  • An infobox is a generalization of a taxobox (from
    taxonomy) which summarizes information for an
    organism or group of organisms.

6
Infobox examples
Basic infobox
Taxobox Plant species
7
More example
Infobox People - Actor
Infobox- Convention Center
8
Outline
  • Background Kylin Extraction
  • Long-Tailed Challenges
  • Sparse infobox classes
  • Incomplete articles
  • Moving Down the Long Tails
  • Shrinkage
  • Retraining
  • Extracting from the Web
  • Problem with information Extraction
  • IWP (Intelligence in Wikipedia)
  • CCC and IE
  • Virtuous Cycle
  • IWP (Shrinkage, Retraining and Extracting from
    Web)
  • Multilingual Extraction
  • Summary

9
Kylin Autonomously Semantifying Wikipedia
  • Totally autonomous with no additional human
    efforts
  • Form training dataset based on infoboxes
  • Extract semantic relations from Wikipedia articles

Kylin a mythical hooved Chinese chimerical
creature that is said to appear in conjunction
with the arrival of a sage. ------Wikipedia
10
Kylin
  • It is a prototype of self-supervised, machine
    learning system
  • It looks for classes of pages with similar
    infoboxes
  • It determines common attributes
  • It creates training examples

11
Infobox Generation
12
Preprocessor
  • Schema Refinement Free edit -gt schema drift
  • Duplicate templatesU.S.County(1428), US
    County(574), Counties(50), County(19)
  • Low usage of attribute
  • Duplicate attributesCensus Yr, Census
    Estimate Yr, Census Est., Census Year
  • Kylin
  • Strict name match
  • 15 occurrences

13
Preprocessor
  • Training Dataset Construction

Clearfield County was created on 1804 from parts
of Huntingdon and Lycoming Counties but was
administered as part of Centre County until 1812.
Its county seat is Clearfield.
2,972 km² (1,147 mi²) of it is land and 17 km²
 (7 mi²) of it (0.56) is water.
As of 2005, the population density was 28.2/km².
14
Classifier
  • Document Classifier
  • List and Category
  • Fast
  • Precision(98.5)
  • Recall(68.8)
  • Sentence Classifier
  • Predicts which attribute value are contained in
    given sentence.
  • It uses maximum entropy model.
  • To decrease noisy and incomplete training
    dataset, Kylin apply bagging.

15
CRF Extractor
  • Conditional Random Fields Model
  • Attribute value extraction sequential data
    labeling
  • CRF model for each attribute independently
  • Relabelfilter false negative training examples
  • 2,972km²(1,147mi²) of it is land and 17km²(7mi²)
    of it (0.56) is water.
  • Preprocessor Water_area
  • Classifier Water_area Land_area
  • Though Kylin is successful on popular classes,
    its performance decreases on sparse classes where
    there is insufficient training data.

16
Outline
  • Background Kylin Extraction
  • Long-Tailed Challenges
  • Sparse infobox classes
  • Incomplete articles
  • Moving Down the Long Tails
  • Shrinkage
  • Retraining
  • Extracting from the Web
  • Problem with information Extraction
  • IWP (Intelligence in Wikipedia)
  • CCC and IE
  • Virtuous Cycle
  • IWP (Shrinkage, Retraining and Extracting from
    Web)
  • Multilingual Extraction
  • Summary

17
Long-Tail 1 Sparse Infobox Class
  • Kylin Performs Well on Popular Classes
  • Precision mid 70 high 90
  • Recall low 50 mid 90
  • Kylin Flounders on Sparse Classes Little
    Training Data

e.g for US County class Kylin has 97.3
precision and 95.9 recall while many other
classes like Irish Newspaper contains very
small number of infobox containing articles
18
Long-Tail 2 Incomplete Articles
  • Desired Information Missing from Wikipedia
  • Among 1.8 millions pages July 2007 of Wikipedia
    many are short articles and almost 800,000
    (44.2) are marked as stub pages indicating much
    needed information is missing.

19
Outline
  • Background Kylin Extraction
  • Long-Tailed Challenges
  • Sparse infobox classes
  • Incomplete articles
  • Moving Down the Long Tails
  • Shrinkage
  • Retraining
  • Extracting from the Web
  • Problem with information Extraction
  • IWP (Intelligence in Wikipedia)
  • CCC and IE
  • Virtuous Cycle
  • IWP (Shrinkage, Retraining and Extracting from
    Web)
  • Multilingual Extraction
  • Summary

20
Shrinkage
  • Attempt to improve Kylins performance using
    shrinkage.
  • We use Shrinkage when training an extractor of an
    instance-space infobox class by aggregating data
    from its parent and children classes

21
Shrinkage
McCallum et al., ICML98
22
Shrinkage
  • KOG (Kylin Ontology Generator) Wu Weld, WWW08

person (1201)
performer (44)
actor (8738)
comedian (106)
.birth_place
.location
.birthplace .birth_place .cityofbirth .origin
23
Outline
  • Background Kylin Extraction
  • Long-Tailed Challenges
  • Sparse infobox classes
  • Incomplete articles
  • Moving Down the Long Tails
  • Shrinkage
  • Retraining
  • Extracting from the Web
  • Problem with information Extraction
  • IWP (Intelligence in Wikipedia)
  • CCC and IE
  • Virtuous Cycle
  • IWP (Shrinkage, Retraining and Extracting from
    Web)
  • Multilingual Extraction
  • Summary

24
Retraining
Complementary to Shrinkage Harvest extra
training data from broader Web
  • Key
  • Identify relevant sentences given the sea of Web
    data?

Andrew Murray was born in Scotland in 1828
ltAndrew Murray, was born in, Scotlandgt ltAndrew
Murray, was born in, 1828gt
25
Retraining
Kylin Extraction TextRunner Extraction Query
TextRunner for relevant sentences
tlt Ada Cambridge, location, St Germans ,
Norfolk , Englandgt
  • r1ltAda Cambridge, was born in, Englandgt
  • Ada Cambridge was born in England in 1844 and
    moved to Australia with her curate husband in
    1870.
  • r2ltAda Cambridge, was born in, Norfolk ,
    Englandgt
  • Ada Cambridge was born in Norfolk , England , in
    1844 .

26
Effect of Shrinkage Retraining
27
Effect of Shrinkage Retraining
1755 improvement for a sparse class
13.7 improvement for a popular class
28
Outline
  • Background Kylin Extraction
  • Long-Tailed Challenges
  • Sparse infobox classes
  • Incomplete articles
  • Moving Down the Long Tails
  • Shrinkage
  • Retraining
  • Extracting from the Web
  • Problem with information Extraction
  • IWP (Intelligence in Wikipedia)
  • CCC and IE
  • Virtuous Cycle
  • IWP (Shrinkage, Retraining and Extracting from
    Web)
  • Multilingual Extraction
  • Summary

29
Extraction from the Web
  • Idea apply Kylin extractors trained on Wikipedia
    to general Web pages
  • Challenge maintain high precision
  • General Web pages are noisy
  • Many Web pages describe multiple objects
  • Key retrieve relevant sentences
  • Procedure
  • Generate a set of search engine queries
  • Retrieve top-k pages from Google
  • Weight extractions from these pages

30
Choosing Queries
Example get birth date attribute for article
titled Andrew Murray (minister) andrew
murrayandrew murray birth dateandrew
murray was born inandrew murray
attribute name
predicatesfromTextRunner
31
Weighting Extractions
  • Which extractions are more relevant?
  • Features
  • sentences between sentence and closest
    occurrence of title (andrew murray)
  • rank of page on Googles result lists
  • Kylins extractor confidence

32
Web Extraction Experiment
  • Extractor confidence alone performs poor
  • Weighted combination is the best

33
Combining Wikipedia Web
Recall Benefit from Shrinkage / Retraining
34
Combining Wikipedia Web
Benefit from Shrinkage Retraining Web
35
Outline
  • Background Kylin Extraction
  • Long-Tailed Challenges
  • Sparse infobox classes
  • Incomplete articles
  • Moving Down the Long Tails
  • Shrinkage
  • Retraining
  • Extracting from the Web
  • Problem with information Extraction
  • IWP (Intelligence in Wikipedia)
  • CCC and IE
  • Virtuous Cycle
  • IWP (Shrinkage, Retraining and Extracting from
    Web)
  • Multilingual Extraction
  • Summary

36
Problem
  • Information Extraction is Imprecise
  • Wikipedians Dont Want 90 Precision
  • How Improve Precision?
  • People!

37
Outline
  • Background Kylin Extraction
  • Long-Tailed Challenges
  • Sparse infobox classes
  • Incomplete articles
  • Moving Down the Long Tails
  • Shrinkage
  • Retraining
  • Extracting from the Web
  • Problem with information Extraction
  • IWP (Intelligence in Wikipedia)
  • CCC and IE
  • Virtuous Cycle
  • IWP (Shrinkage, Retraining and Extracting from
    Web)
  • Multilingual Extraction
  • Summary

38
Intelligence in Wikipedia
  • What is IWP?
  • A project/system that aims to combine
  • IE (Information Extraction)
  • CCC (communal content creation)

39
Information Extraction
  • Examples
  • Zoominfo.com
  • Fligdog.com
  • Citeseer
  • Google
  • Advantage Autonomy
  • Disadvantage Expensive

40
IE system contributors
  • Contributors in this room?
  • Wikipedia
  • IE systems
  • Citeseer
  • Rexa
  • DBlife

41
Communal Content Creation
  • Examples
  • Wikipedia
  • Ebay
  • Netflix
  • Advantage more accuracy then IE
  • Disadvantage bootstrapping, incentives, and
    management

42
Outline
  • Background Kylin Extraction
  • Long-Tailed Challenges
  • Sparse infobox classes
  • Incomplete articles
  • Moving Down the Long Tails
  • Shrinkage
  • Retraining
  • Extracting from the Web
  • Problem with information Extraction
  • IWP (Intelligence in Wikipedia)
  • CCC and IE
  • Virtuous Cycle
  • IWP (Shrinkage, Retraining and Extracting from
    Web)
  • Multilingual Extraction
  • Summary

43
Virtuous Cycle
44
Contributing as a Non-Primary Task
  • Encourage contributions
  • Without annoying or abusing readers
  • Compared 5 different interfaces

45
(No Transcript)
46
Results
  • Contribution Rate
  • 1.6 ? 13
  • 90 of positive labels were correct

47
Outline
  • Background Kylin Extraction
  • Long-Tailed Challenges
  • Sparse infobox classes
  • Incomplete articles
  • Moving Down the Long Tails
  • Shrinkage
  • Retraining
  • Extracting from the Web
  • Problem with information Extraction
  • IWP (Intelligence in Wikipedia)
  • CCC and IE
  • Virtuous Cycle
  • IWP (Shrinkage, Retraining and Extracting from
    Web)
  • Multilingual Extraction
  • Summary

48
IWP and Shrinkage, Retraining, and Extracting
from the Web
  • Shrinkage improves IWPs precision, and recall
  • Retraining improves the robustness of IWPs
    extractors
  • Extraction further helps IWPs performance

49
Multi-Lingual Extraction
  • Idea Further leverage the virtuous feedback
    cycle
  • Utilize IE methods to add or update missing
    information by copying from one language to
    another
  • Utilize CCC to validate and improve updates.
  • Example
  • Nombre Jerry Seinfeld and Name Jerry
    Seinfeld
  • Cónyuge Jessica Sklar and Spouse Jessica
    Sienfeld

50
Summary
  • Kylins initial performance is unacceptable
  • Methods for increasing recall
  • Shrinkage
  • Retraining
  • Extraction from the web

51
Summary
  • IWP developing AI methods to facilitate the
    growth, operation and use of Wikipedia
  • Initial goal extraction of a giant knowledge
    bas of semantic triples
  • Faceted browsing
  • Input to reasoning based question-answering
    system
  • How
  • IE
  • CCC

52
Questions
?
Write a Comment
User Comments (0)
About PowerShow.com