Title: Using the Web to Reduce Data Sparseness in Patternbased Information Extraction
1Using the Web to Reduce Data Sparseness
inPattern-based Information Extraction
PKDD 2007, September 20th 2007
- Sebastian Blohm, Philipp Cimiano
- Universität Karlsruhe (TH), Institut AIFB
- blohm_at_aifb.uni-karlsruhe.de
2 3Pattern-based Extraction
- Assumption The way relation instances occur in
- Regularity Occurrences of relation instances
have something in common - Redundancy Relation instances occurr frequently
and in different forms - Paradigms of pattern based extraction
- Fixed patterns (extremely successful for
relations like "is-a") Hearst92 - Induction of Patterns
- Sources
- Large corpora (recent Espresso06)
- Web DIPRE98
- Methods
- Bootstrapping based-learning DIPRE98
- Bottom-up learning LP201
- Vector Space model Snowball00
The happiest people in Germany live in
Osnabrück . The richest people in America live in
Hollywood. The people in
live in .
4Problem statement
- Assumption The way relation instances occur in
- Regularity Occurrences of relation instances
have something in common - Redundancy Relation instances occurr frequently
and in different forms - Problem High-quality knowledge sources
frequently lack redundancy. - Our Goal
- We use the Web as "background knowledge" to help
the extraction process bootstrap - We show that our method of integrating Web
extraction has a similar effect as strongly
increasing the number of training samples.
5Outline
- General method learning relations from Wikipedia
- Comparing Web and Wikipedia
- Integrating the Web as background knowledge
- Experimental evaluation
- Conclusion
6(No Transcript)
7Iterative Pattern Induction
match on wiki
retrieve contexts
Bass America Minnow
St. Lawrence Haddock North
Atlantic Bluegill Québec
... is from the St. Lawrence and Lake Ontario
south ... of which are native to North America
and ... ... on both sides of the North Atlantic.
Haddock is ... ... rarely found outside New
England and New York...
learn patterns
Scrod New England Salmon
Atlantic Hypostomus South America
Tuna Japan
... is from the and
south ... of which are native to
and ... ... on both sides of the
. is ... ... rarely found
outside and ...
match patterns
harvest instances
8The Bootstrapping Effect
Instances
Patterns
9... but when redundancy is low
Instances
Patterns
10How often do relation instance co-occur?
- Wikipedia
- 1.6 Mio articles (Dec 2006)
- One article on each topic.
- Redundancy strictly avoided.
- Web
- Billions of pages
- No structure
- Highly redundant
median 15
median 48000
11Extended Algorithm
Idea "Imprecise Knowledge can without any harm
be used to get boot- strapping going as long as
it can be eliminated at later stages." We
distinguish three modes of operation
- Find seeds on the Web
- Learn patterns
- Match patterns on Web
- Keep only tuples also present on the Wiki
- Find seeds in wiki
- Learn patterns
- Match patterns on wiki
- Filter new tuples
- Find seeds on the Web
- Learn patterns
- Match patterns on Web
- Keep only tuples also present on the Wiki
- Find seeds in wiki
- Learn patterns
- Match patterns on wiki
- Filter new tuples
- Find seeds in wiki
- Learn patterns
- Match patterns on wiki
- Filter new tuples
12"Dual" Extraction
Learn Patterns
Match Tuples
Match Patterns
Match Patterns
Filter for Wiki presence
Learn Patterns
Match Tuples
13"Wiki Only" Extraction
Learn Patterns
Match Tuples
Match Patterns
Match Patterns
Filter for Wiki presence
Learn Patterns
Match Tuples
14"Web Once" Extraction
Learn Patterns
Match Tuples
Match Patterns
Iteration 1 only
Match Patterns
Filter for Wiki presence
Learn Patterns
Match Tuples
15The Testbed
- Large relation extension taken from (manually
compiled/reviewed) lists from Wikipedia, the
SmartWeb consortium and the CIA World factbook. - albumBy 19852 musicians and their albums
(occasionally other types of musical works) - bornInYear 172696 persons and their year of
birth - currencyOf 221 countries and their official
currency - headquarteredIn 14762 companies and the country
of their headquarter - locatedIn 34047 cities and the Country they lie
in. - productOf 2650 product names and the brand names
of their makers. - teamOf 8307 sportspersons and the team or
country they are playing for - Available for download http//www.aifb.uni-karlsr
uhe.de/WBS/seb/datasets
16Results Correct Extractions over Iterations
900 500 0
web once
dual
wiki only
3 iterations
6 iterations
9 iterations
- Wiki only initally generates more correct
information but shows a poor development over
further iterations. - Note Extractions from the Web are not counted as
additional correct extractions. - Web extractions enable a more steady development
of correct extractions. - "Web once" more productive because the wiki
matches are more precise.
17Precision, Recall, F-Measure
50 seeds
100 seeds
10 seeds
- Solely Wiki-Based extraction fails to bootstrap
with small seed set. - "Web once" strategy compensates for small seed
set - Continous Web-Matching does not lead to further
benefit (in the 9 iterations considered)
18Conclusion
- Analyzed why it is difficult to extract
information from Wikipedia with the classical
bootstrapping-based approach. - Presented method to extract from Wikipedia
through supplementary extraction from the Web. - Use the Web to retrieve further examples.
- Use them for pattern induction but not as target
instances. - This generalizes to a method for using noisy
background knowledge to overcome data sparseness. - Induce further seeds in noisy but rich dataset.
- Use them as seeds-only, not as output.
- In particular it is interesting to see, that the
effects of employing the Web is similar to that
of providing more seeds.
19Outlook this subject
- Analyze to properties of the corpora to tell more
precisely what make the difference. - Add further Wikipedia features (e.g. categories)
- Invesigate results over more iterations.
Outlook futher developments
- Improve pattern induction efficiency through
optimized mining algorithms. - Richer structure in patterns (POS, semantic
annotation etc.) - More automatic adaptivity of the system through
automatic parametrisation. - Use negative examples where available (in
particular relations that are very close).
20Thank you for your attention
- Using the Web to Reduce Data Sparseness
inPattern-based Information Extraction - Sebastian Blohm, Philipp Cimiano
- Universität Karlsruhe (TH), Institut AIFB
- blohm_at_aifb.uni-karlsruhe.de
- Knowledge?
- located-in(dispute,resolution)
product-of(football,college) - product-of(child, mapping)
product-of(web service, xml) - located-in(bankruptcy, switzerland)
located-in(bagdad, kentucky) - located-in(california, mississippi)
located-in(her, in his) - currency-of(euro,france), currency-of(eurocent,por
tugal)
21References
- Hearst92 M. A. Hearst, \Automatic acquisition
of hyponyms from large text corpora," in
Proceedings of the 14th conference on
Computational linguistics. Morristown, NJ,USA
Association for Computational Linguistics, 1992,
pp. 539-545. - DIPRE98 S. Brin, \Extracting patterns and
relations from the world wide web," in WebDB
Workshop at 6th International Conference on
Extending Database Technology,EDBT'98, 1998. - KnowItAll05 O. Etzioni, M. Cafarella, D.
Downey, A.-M. Popescu, T. Shaked, S. Soderland,D.
S. Weld, and A. Yates, \Unsupervised named-entity
extraction from the web an experimental study,"
Artif. Intell., vol. 165, no. 1, - Snowball00 E. Agichtein and L. Gravano,
\Snowball extracting relations from large
plain-text collections," in DL '00 Proceedings
of the fth ACM conference on Digital libraries.
New York, NY, USA ACM Press, 2000 - Espresso06 M. Pennacchiotti and P. Pantel, \A
bootstrapping algorithm for automatically
harvesting semantic relations," in Proceedings of
Inference in Computational Semantics (ICoS-06),
Buxton, England. - Etzioni et al., 2005 Oren Etzioni, Michael J.
Cafarella, Doug Downey, Ana-Maria Popescu, Tal
Shaked, Stephen Soderland, Daniel S. Weld,
Alexander Yates Unsupervised named-entity
extraction from the Web An experimental study.
Artificial Intelligence 165(1) 91-134 (2005) - LP201 Fabio Ciravegna Adaptive Information
Extraction from Text by Rule Induction and
Generalisation, in Proceedings of 17th
International Joint Conference on Artificial
Intelligence (IJCAI 2001), Seattle, August 2001
22TODO Keep this? Challenges
- Most relation instances ocurr very sparsely.
- Good trade-off between precision and recall is
required. - Low redundancy in high-quality corpora like
Wikipedia makes bootstrapping difficult. - Search-pattern based extraction allows extraction
of infrequent instances without processing the
entire corpus. - Pattern filtering allows trading precision and
recall. - Pattern support is a good predictor for
precision. - Bootstrapping can be boosted by using background
knowledge.
Results
23TODO Determining Factors for Extraction
- TODO use Skype Icons again (to mark what is the
case in Wikipedia and what not) - TODO re-read conclusion by Uszkoreit
- TODO re-read paper
- Data richness
- How often do relation instances occurr in similar
context? - Do the same relation instances re-appear in
different contexts? - Types of Target Relations
- Overlapping relations are hard to tell appart
(e.g. ) - Sub- and super relations are hard to tell apart.
(e.g. capitalOf and locatedIn)
24Feedback
- We should analyze to properties of the corpora
(e.g. replace Web by books, smaller subset of the
web...) - Explain the problem as there being a sets of
different context that you would like to bridge - Better treat the precision bars on slide 14
- Better explain the notion of relative recall
- Slide 3 is very vague
25Possible Optimizations to improve F-Measure
- Apply Part-of-Speech and Named-Entity tagging
where applicable. - Make use of sentence structure (e.g. ... live in
Hollywood or Beverly Hills") - Use negative examples where available (in
particular relations that are very close). - Take into account knowledge about the structure
of the target relation (e.g. a city can only be
in one country).
26TODO Extended Algorithm
27When redundancy is low...
The cycle allows to generate patterns from
instances
and vice versa
But for a pattern to be generated you need at
least one instance that appears in it.
Our approach If we get some possibly noise set
of instanes chances are higher to be able to
generate patterns.