Using the Web to Reduce Data Sparseness in Patternbased Information Extraction

1 / 27

About This Presentation

Title:

Using the Web to Reduce Data Sparseness in Patternbased Information Extraction

Description:

Tuna Japan. match on wiki. retrieve contexts. learn patterns. match patterns ... Make use of sentence structure (e.g. '... live in Hollywood or Beverly Hills' ... –

Number of Views:87

Avg rating:3.0/5.0

Slides: 28

Provided by: sudhiragar

Category:

more less

Transcript and Presenter's Notes

Title: Using the Web to Reduce Data Sparseness in Patternbased Information Extraction

1
Using the Web to Reduce Data Sparseness
inPattern-based Information Extraction
PKDD 2007, September 20th 2007

Sebastian Blohm, Philipp Cimiano
Universität Karlsruhe (TH), Institut AIFB
blohm_at_aifb.uni-karlsruhe.de

2

3
Pattern-based Extraction

Assumption The way relation instances occur in
Regularity Occurrences of relation instances
have something in common
Redundancy Relation instances occurr frequently
and in different forms
Paradigms of pattern based extraction
Fixed patterns (extremely successful for
relations like "is-a") Hearst92
Induction of Patterns
Sources
Large corpora (recent Espresso06)
Web DIPRE98
Methods
Bootstrapping based-learning DIPRE98
Bottom-up learning LP201
Vector Space model Snowball00

The happiest people in Germany live in
Osnabrück . The richest people in America live in
Hollywood. The people in
live in .
4
Problem statement

Assumption The way relation instances occur in
Regularity Occurrences of relation instances
have something in common
Redundancy Relation instances occurr frequently
and in different forms
Problem High-quality knowledge sources
frequently lack redundancy.
Our Goal
We use the Web as "background knowledge" to help
the extraction process bootstrap
We show that our method of integrating Web
extraction has a similar effect as strongly
increasing the number of training samples.

5
Outline

General method learning relations from Wikipedia
Comparing Web and Wikipedia
Integrating the Web as background knowledge
Experimental evaluation
Conclusion

6
(No Transcript)
7
Iterative Pattern Induction
match on wiki
retrieve contexts
Bass America Minnow
St. Lawrence Haddock North
Atlantic Bluegill Québec
... is from the St. Lawrence and Lake Ontario
south ... of which are native to North America
and ... ... on both sides of the North Atlantic.
Haddock is ... ... rarely found outside New
England and New York...
learn patterns
Scrod New England Salmon
Atlantic Hypostomus South America
Tuna Japan
... is from the and
south ... of which are native to
and ... ... on both sides of the
. is ... ... rarely found
outside and ...
match patterns
harvest instances
8
The Bootstrapping Effect
Instances
Patterns
9
... but when redundancy is low
Instances
Patterns
10
How often do relation instance co-occur?

Wikipedia
1.6 Mio articles (Dec 2006)
One article on each topic.
Redundancy strictly avoided.
Web
Billions of pages
No structure
Highly redundant

median 15
median 48000
11
Extended Algorithm
Idea "Imprecise Knowledge can without any harm
be used to get boot- strapping going as long as
it can be eliminated at later stages." We
distinguish three modes of operation

Find seeds on the Web
Learn patterns
Match patterns on Web
Keep only tuples also present on the Wiki
Find seeds in wiki
Learn patterns
Match patterns on wiki
Filter new tuples

Find seeds on the Web
Learn patterns
Match patterns on Web
Keep only tuples also present on the Wiki
Find seeds in wiki
Learn patterns
Match patterns on wiki
Filter new tuples

Find seeds in wiki
Learn patterns
Match patterns on wiki
Filter new tuples

12
"Dual" Extraction
Learn Patterns
Match Tuples
Match Patterns
Match Patterns
Filter for Wiki presence
Learn Patterns
Match Tuples
13
"Wiki Only" Extraction
Learn Patterns
Match Tuples
Match Patterns
Match Patterns
Filter for Wiki presence
Learn Patterns
Match Tuples
14
"Web Once" Extraction
Learn Patterns
Match Tuples
Match Patterns
Iteration 1 only
Match Patterns
Filter for Wiki presence
Learn Patterns
Match Tuples
15
The Testbed

Large relation extension taken from (manually
compiled/reviewed) lists from Wikipedia, the
SmartWeb consortium and the CIA World factbook.
albumBy 19852 musicians and their albums
(occasionally other types of musical works)
bornInYear 172696 persons and their year of
birth
currencyOf 221 countries and their official
currency
headquarteredIn 14762 companies and the country
of their headquarter
locatedIn 34047 cities and the Country they lie
in.
productOf 2650 product names and the brand names
of their makers.
teamOf 8307 sportspersons and the team or
country they are playing for
Available for download http//www.aifb.uni-karlsr
uhe.de/WBS/seb/datasets

16
Results Correct Extractions over Iterations
900 500 0
web once
dual
wiki only
3 iterations
6 iterations
9 iterations

Wiki only initally generates more correct
information but shows a poor development over
further iterations.
Note Extractions from the Web are not counted as
additional correct extractions.
Web extractions enable a more steady development
of correct extractions.
"Web once" more productive because the wiki
matches are more precise.

17
Precision, Recall, F-Measure
50 seeds
100 seeds
10 seeds

Solely Wiki-Based extraction fails to bootstrap
with small seed set.
"Web once" strategy compensates for small seed
set
Continous Web-Matching does not lead to further
benefit (in the 9 iterations considered)

18
Conclusion

Analyzed why it is difficult to extract
information from Wikipedia with the classical
bootstrapping-based approach.
Presented method to extract from Wikipedia
through supplementary extraction from the Web.
Use the Web to retrieve further examples.
Use them for pattern induction but not as target
instances.
This generalizes to a method for using noisy
background knowledge to overcome data sparseness.
Induce further seeds in noisy but rich dataset.
Use them as seeds-only, not as output.
In particular it is interesting to see, that the
effects of employing the Web is similar to that
of providing more seeds.

19
Outlook this subject

Analyze to properties of the corpora to tell more
precisely what make the difference.
Add further Wikipedia features (e.g. categories)
Invesigate results over more iterations.

Outlook futher developments

Improve pattern induction efficiency through
optimized mining algorithms.
Richer structure in patterns (POS, semantic
annotation etc.)
More automatic adaptivity of the system through
automatic parametrisation.
Use negative examples where available (in
particular relations that are very close).

20
Thank you for your attention

Using the Web to Reduce Data Sparseness
inPattern-based Information Extraction
Sebastian Blohm, Philipp Cimiano
Universität Karlsruhe (TH), Institut AIFB
blohm_at_aifb.uni-karlsruhe.de
Knowledge?
located-in(dispute,resolution)
product-of(football,college)
product-of(child, mapping)
product-of(web service, xml)
located-in(bankruptcy, switzerland)
located-in(bagdad, kentucky)
located-in(california, mississippi)
located-in(her, in his)
currency-of(euro,france), currency-of(eurocent,por
tugal)

21
References

Hearst92 M. A. Hearst, \Automatic acquisition
of hyponyms from large text corpora," in
Proceedings of the 14th conference on
Computational linguistics. Morristown, NJ,USA
Association for Computational Linguistics, 1992,
pp. 539-545.
DIPRE98 S. Brin, \Extracting patterns and
relations from the world wide web," in WebDB
Workshop at 6th International Conference on
Extending Database Technology,EDBT'98, 1998.
KnowItAll05 O. Etzioni, M. Cafarella, D.
Downey, A.-M. Popescu, T. Shaked, S. Soderland,D.
S. Weld, and A. Yates, \Unsupervised named-entity
extraction from the web an experimental study,"
Artif. Intell., vol. 165, no. 1,
Snowball00 E. Agichtein and L. Gravano,
\Snowball extracting relations from large
plain-text collections," in DL '00 Proceedings
of the fth ACM conference on Digital libraries.
New York, NY, USA ACM Press, 2000
Espresso06 M. Pennacchiotti and P. Pantel, \A
bootstrapping algorithm for automatically
harvesting semantic relations," in Proceedings of
Inference in Computational Semantics (ICoS-06),
Buxton, England.
Etzioni et al., 2005 Oren Etzioni, Michael J.
Cafarella, Doug Downey, Ana-Maria Popescu, Tal
Shaked, Stephen Soderland, Daniel S. Weld,
Alexander Yates Unsupervised named-entity
extraction from the Web An experimental study.
Artificial Intelligence 165(1) 91-134 (2005)
LP201 Fabio Ciravegna Adaptive Information
Extraction from Text by Rule Induction and
Generalisation, in Proceedings of 17th
International Joint Conference on Artificial
Intelligence (IJCAI 2001), Seattle, August 2001

22
TODO Keep this? Challenges

Most relation instances ocurr very sparsely.
Good trade-off between precision and recall is
required.
Low redundancy in high-quality corpora like
Wikipedia makes bootstrapping difficult.
Search-pattern based extraction allows extraction
of infrequent instances without processing the
entire corpus.
Pattern filtering allows trading precision and
recall.
Pattern support is a good predictor for
precision.
Bootstrapping can be boosted by using background
knowledge.

Results
23
TODO Determining Factors for Extraction

TODO use Skype Icons again (to mark what is the
case in Wikipedia and what not)
TODO re-read conclusion by Uszkoreit
TODO re-read paper
Data richness
How often do relation instances occurr in similar
context?
Do the same relation instances re-appear in
different contexts?
Types of Target Relations
Overlapping relations are hard to tell appart
(e.g. )
Sub- and super relations are hard to tell apart.
(e.g. capitalOf and locatedIn)

24
Feedback

We should analyze to properties of the corpora
(e.g. replace Web by books, smaller subset of the
web...)
Explain the problem as there being a sets of
different context that you would like to bridge
Better treat the precision bars on slide 14
Better explain the notion of relative recall
Slide 3 is very vague

25
Possible Optimizations to improve F-Measure

Apply Part-of-Speech and Named-Entity tagging
where applicable.
Make use of sentence structure (e.g. ... live in
Hollywood or Beverly Hills")
Use negative examples where available (in
particular relations that are very close).
Take into account knowledge about the structure
of the target relation (e.g. a city can only be
in one country).

26
TODO Extended Algorithm
27
When redundancy is low...
The cycle allows to generate patterns from
instances
and vice versa
But for a pattern to be generated you need at
least one instance that appears in it.
Our approach If we get some possibly noise set
of instanes chances are higher to be able to
generate patterns.

Write a Comment

User Comments (0)