Using the Web to Reduce Data Sparseness in Patternbased Information Extraction

1 / 27
About This Presentation
Title:

Using the Web to Reduce Data Sparseness in Patternbased Information Extraction

Description:

Tuna Japan. match on wiki. retrieve contexts. learn patterns. match patterns ... Make use of sentence structure (e.g. '... live in Hollywood or Beverly Hills' ... –

Number of Views:87
Avg rating:3.0/5.0
Slides: 28
Provided by: sudhiragar
Category:

less

Transcript and Presenter's Notes

Title: Using the Web to Reduce Data Sparseness in Patternbased Information Extraction


1
Using the Web to Reduce Data Sparseness
inPattern-based Information Extraction
PKDD 2007, September 20th 2007
  • Sebastian Blohm, Philipp Cimiano
  • Universität Karlsruhe (TH), Institut AIFB
  • blohm_at_aifb.uni-karlsruhe.de

2

3
Pattern-based Extraction
  • Assumption The way relation instances occur in
  • Regularity Occurrences of relation instances
    have something in common
  • Redundancy Relation instances occurr frequently
    and in different forms
  • Paradigms of pattern based extraction
  • Fixed patterns (extremely successful for
    relations like "is-a") Hearst92
  • Induction of Patterns
  • Sources
  • Large corpora (recent Espresso06)
  • Web DIPRE98
  • Methods
  • Bootstrapping based-learning DIPRE98
  • Bottom-up learning LP201
  • Vector Space model Snowball00

The happiest people in Germany live in
Osnabrück . The richest people in America live in
Hollywood. The people in
live in .
4
Problem statement
  • Assumption The way relation instances occur in
  • Regularity Occurrences of relation instances
    have something in common
  • Redundancy Relation instances occurr frequently
    and in different forms
  • Problem High-quality knowledge sources
    frequently lack redundancy.
  • Our Goal
  • We use the Web as "background knowledge" to help
    the extraction process bootstrap
  • We show that our method of integrating Web
    extraction has a similar effect as strongly
    increasing the number of training samples.

5
Outline
  • General method learning relations from Wikipedia
  • Comparing Web and Wikipedia
  • Integrating the Web as background knowledge
  • Experimental evaluation
  • Conclusion

6
(No Transcript)
7
Iterative Pattern Induction
match on wiki
retrieve contexts
Bass America Minnow
St. Lawrence Haddock North
Atlantic Bluegill Québec
... is from the St. Lawrence and Lake Ontario
south ... of which are native to North America
and ... ... on both sides of the North Atlantic.
Haddock is ... ... rarely found outside New
England and New York...
learn patterns
Scrod New England Salmon
Atlantic Hypostomus South America
Tuna Japan
... is from the and
south ... of which are native to
and ... ... on both sides of the
. is ... ... rarely found
outside and ...
match patterns
harvest instances
8
The Bootstrapping Effect
Instances
Patterns
9
... but when redundancy is low
Instances
Patterns
10
How often do relation instance co-occur?
  • Wikipedia
  • 1.6 Mio articles (Dec 2006)
  • One article on each topic.
  • Redundancy strictly avoided.
  • Web
  • Billions of pages
  • No structure
  • Highly redundant

median 15
median 48000
11
Extended Algorithm
Idea "Imprecise Knowledge can without any harm
be used to get boot- strapping going as long as
it can be eliminated at later stages." We
distinguish three modes of operation
  • Find seeds on the Web
  • Learn patterns
  • Match patterns on Web
  • Keep only tuples also present on the Wiki
  • Find seeds in wiki
  • Learn patterns
  • Match patterns on wiki
  • Filter new tuples
  • Find seeds on the Web
  • Learn patterns
  • Match patterns on Web
  • Keep only tuples also present on the Wiki
  • Find seeds in wiki
  • Learn patterns
  • Match patterns on wiki
  • Filter new tuples
  • Find seeds in wiki
  • Learn patterns
  • Match patterns on wiki
  • Filter new tuples

12
"Dual" Extraction
Learn Patterns
Match Tuples
Match Patterns
Match Patterns
Filter for Wiki presence
Learn Patterns
Match Tuples
13
"Wiki Only" Extraction
Learn Patterns
Match Tuples
Match Patterns
Match Patterns
Filter for Wiki presence
Learn Patterns
Match Tuples
14
"Web Once" Extraction
Learn Patterns
Match Tuples
Match Patterns
Iteration 1 only
Match Patterns
Filter for Wiki presence
Learn Patterns
Match Tuples
15
The Testbed
  • Large relation extension taken from (manually
    compiled/reviewed) lists from Wikipedia, the
    SmartWeb consortium and the CIA World factbook.
  • albumBy 19852 musicians and their albums
    (occasionally other types of musical works)
  • bornInYear 172696 persons and their year of
    birth
  • currencyOf 221 countries and their official
    currency
  • headquarteredIn 14762 companies and the country
    of their headquarter
  • locatedIn 34047 cities and the Country they lie
    in.
  • productOf 2650 product names and the brand names
    of their makers.
  • teamOf 8307 sportspersons and the team or
    country they are playing for
  • Available for download http//www.aifb.uni-karlsr
    uhe.de/WBS/seb/datasets

16
Results Correct Extractions over Iterations
900 500 0
web once
dual
wiki only
3 iterations
6 iterations
9 iterations
  • Wiki only initally generates more correct
    information but shows a poor development over
    further iterations.
  • Note Extractions from the Web are not counted as
    additional correct extractions.
  • Web extractions enable a more steady development
    of correct extractions.
  • "Web once" more productive because the wiki
    matches are more precise.

17
Precision, Recall, F-Measure
50 seeds
100 seeds
10 seeds
  • Solely Wiki-Based extraction fails to bootstrap
    with small seed set.
  • "Web once" strategy compensates for small seed
    set
  • Continous Web-Matching does not lead to further
    benefit (in the 9 iterations considered)

18
Conclusion
  • Analyzed why it is difficult to extract
    information from Wikipedia with the classical
    bootstrapping-based approach.
  • Presented method to extract from Wikipedia
    through supplementary extraction from the Web.
  • Use the Web to retrieve further examples.
  • Use them for pattern induction but not as target
    instances.
  • This generalizes to a method for using noisy
    background knowledge to overcome data sparseness.
  • Induce further seeds in noisy but rich dataset.
  • Use them as seeds-only, not as output.
  • In particular it is interesting to see, that the
    effects of employing the Web is similar to that
    of providing more seeds.

19
Outlook this subject
  • Analyze to properties of the corpora to tell more
    precisely what make the difference.
  • Add further Wikipedia features (e.g. categories)
  • Invesigate results over more iterations.

Outlook futher developments
  • Improve pattern induction efficiency through
    optimized mining algorithms.
  • Richer structure in patterns (POS, semantic
    annotation etc.)
  • More automatic adaptivity of the system through
    automatic parametrisation.
  • Use negative examples where available (in
    particular relations that are very close).

20
Thank you for your attention
  • Using the Web to Reduce Data Sparseness
    inPattern-based Information Extraction
  • Sebastian Blohm, Philipp Cimiano
  • Universität Karlsruhe (TH), Institut AIFB
  • blohm_at_aifb.uni-karlsruhe.de
  • Knowledge?
  • located-in(dispute,resolution)
    product-of(football,college)
  • product-of(child, mapping)
    product-of(web service, xml)
  • located-in(bankruptcy, switzerland)
    located-in(bagdad, kentucky)
  • located-in(california, mississippi)
    located-in(her, in his)
  • currency-of(euro,france), currency-of(eurocent,por
    tugal)

21
References
  • Hearst92 M. A. Hearst, \Automatic acquisition
    of hyponyms from large text corpora," in
    Proceedings of the 14th conference on
    Computational linguistics. Morristown, NJ,USA
    Association for Computational Linguistics, 1992,
    pp. 539-545.
  • DIPRE98 S. Brin, \Extracting patterns and
    relations from the world wide web," in WebDB
    Workshop at 6th International Conference on
    Extending Database Technology,EDBT'98, 1998.
  • KnowItAll05 O. Etzioni, M. Cafarella, D.
    Downey, A.-M. Popescu, T. Shaked, S. Soderland,D.
    S. Weld, and A. Yates, \Unsupervised named-entity
    extraction from the web an experimental study,"
    Artif. Intell., vol. 165, no. 1,
  • Snowball00 E. Agichtein and L. Gravano,
    \Snowball extracting relations from large
    plain-text collections," in DL '00 Proceedings
    of the fth ACM conference on Digital libraries.
    New York, NY, USA ACM Press, 2000
  • Espresso06 M. Pennacchiotti and P. Pantel, \A
    bootstrapping algorithm for automatically
    harvesting semantic relations," in Proceedings of
    Inference in Computational Semantics (ICoS-06),
    Buxton, England.
  • Etzioni et al., 2005 Oren Etzioni, Michael J.
    Cafarella, Doug Downey, Ana-Maria Popescu, Tal
    Shaked, Stephen Soderland, Daniel S. Weld,
    Alexander Yates Unsupervised named-entity
    extraction from the Web An experimental study.
    Artificial Intelligence 165(1) 91-134 (2005)
  • LP201 Fabio Ciravegna Adaptive Information
    Extraction from Text by Rule Induction and
    Generalisation, in Proceedings of 17th
    International Joint Conference on Artificial
    Intelligence (IJCAI 2001), Seattle, August 2001

22
TODO Keep this? Challenges
  • Most relation instances ocurr very sparsely.
  • Good trade-off between precision and recall is
    required.
  • Low redundancy in high-quality corpora like
    Wikipedia makes bootstrapping difficult.
  • Search-pattern based extraction allows extraction
    of infrequent instances without processing the
    entire corpus.
  • Pattern filtering allows trading precision and
    recall.
  • Pattern support is a good predictor for
    precision.
  • Bootstrapping can be boosted by using background
    knowledge.

Results
23
TODO Determining Factors for Extraction
  • TODO use Skype Icons again (to mark what is the
    case in Wikipedia and what not)
  • TODO re-read conclusion by Uszkoreit
  • TODO re-read paper
  • Data richness
  • How often do relation instances occurr in similar
    context?
  • Do the same relation instances re-appear in
    different contexts?
  • Types of Target Relations
  • Overlapping relations are hard to tell appart
    (e.g. )
  • Sub- and super relations are hard to tell apart.
    (e.g. capitalOf and locatedIn)

24
Feedback
  • We should analyze to properties of the corpora
    (e.g. replace Web by books, smaller subset of the
    web...)
  • Explain the problem as there being a sets of
    different context that you would like to bridge
  • Better treat the precision bars on slide 14
  • Better explain the notion of relative recall
  • Slide 3 is very vague

25
Possible Optimizations to improve F-Measure
  • Apply Part-of-Speech and Named-Entity tagging
    where applicable.
  • Make use of sentence structure (e.g. ... live in
    Hollywood or Beverly Hills")
  • Use negative examples where available (in
    particular relations that are very close).
  • Take into account knowledge about the structure
    of the target relation (e.g. a city can only be
    in one country).

26
TODO Extended Algorithm
27
When redundancy is low...
The cycle allows to generate patterns from
instances
and vice versa
But for a pattern to be generated you need at
least one instance that appears in it.
Our approach If we get some possibly noise set
of instanes chances are higher to be able to
generate patterns.
Write a Comment
User Comments (0)
About PowerShow.com