Title: Querying Text Databases for Efficient Information Extraction
1Querying Text Databases for Efficient
Information Extraction
- Eugene AgichteinLuis Gravano
- Columbia University
2Extracting Structured Information Buried in
Text Documents
Organization
Location
Microsoft's central headquarters in Redmond is
home to almost every product group and division.
Microsoft Apple Computer Nike
Redmond Cupertino Portland
Brent Barlow, 27, a software analyst and
beta-tester at Apple Computers headquarters in
Cupertino, was fired Monday for "thinking a
little too different."
Apple's programmers "think different" on a
"campus" in Cupertino, Cal. Nike employees "just
do it" at what the company refers to as its
"World Campus," near Portland, Ore.
3Information Extraction Applications
- Over a corporations customer report or email
complaint database enabling sophisticated
querying and analysis - Over biomedical literature identifying
drug/condition interactions - Over newspaper archives tracking disease
outbreaks, terrorist attacks intelligence
Significant progress over the last decade MUC
4Information Extraction Example Organizations
Headquarters
Input Documents
Named-Entity Tagging
Pattern Matching
Output Tuples
5Goal Extract All Tuples of a Relation from a
Document Database
InformationExtraction System
Extracted Tuples
- One approach feed every document to information
extraction system - Problem efficiency!
6Information Extraction is Expensive
- Efficiency is a problem even after training
information extraction system - Example NYUs Proteus extraction system takes
around 9 seconds per document - Over 15 days to process 135,000 news articles
- Filtering before further processing a document
might help - Cant afford to scan the web to process each
page! - Hidden-Web databases dont allow crawling
7Information Extraction Without Processing All
Documents
- Observation Often only small fraction of
database is relevant for an extraction task - Our approach Exploit database search engine to
retrieve and process only promising documents
8Architecture of our QXtract System
User-Provided Seed Tuples
Microsoft Redmond
Apple Cupertino
Query Generation
Queries
Key problem Learn queries to retrieve
promising documents
Promising Documents
Information Extraction
Microsoft Redmond
Apple Cupertino
Exxon Irving
IBM Armonk
Intel Santa Clara
Extracted Relation
9Generating Queries to Retrieve Promising Documents
User-Provided Seed Tuples
Seed Sampling
- Get document sample with likely negative and
likely positive examples. - Label sample documents usinginformation
extraction systemas oracle. - Train classifiers to recognizeuseful
documents. - Generate queries from classifiermodel/rules.
Information Extraction
Classifier Training
Query Generation
Queries
10Getting a Training Document Sample
User-Provided Seed Tuples
Microsoft AND Redmond
Apple AND Cupertino
Random Queries
Get document sample with likely negative and
likely positive examples.
User-Provided Seed Tuples
Seed Sampling
11Labeling the Training Document Sample
Information Extraction System
Use information extraction system as oracle to
label examples as true positive and true
negative.
Microsoft Redmond
Apple Cupertino
IBM Armonk
12Training Classifiers to Recognize Useful
Documents
is based in near city
spokesperson reported news earnings release
products made used exported far
past old homerun sponsored event
Document features words
-
-
Ripper
SVM
Okapi (IR)
based 3
spokesperson 2
sponsored -1
sponsored
based AND near gt Useful
is
homerun
Classifier Training
near
event
earnings
based
far
spokesperson
13Generating Queries from Classifiers
Ripper
SVM
Okapi (IR)
based 3
spokesperson 2
sponsored -1
based AND near gt Useful
sponsored
is
homerun
near
event
earnings
based
far
spokesperson
spokespersonearnings
based AND near
basedspokesperson
QCombined
based AND nearspokespersonbased
Query Generation
Queries
14Architecture of our QXtract System
User-Provided Seed Tuples
Microsoft Redmond
Apple Cupertino
Query Generation
Queries
Promising Documents
Information Extraction
Microsoft Redmond
Apple Cupertino
Exxon Irving
IBM Armonk
Intel Santa Clara
Extracted Relation
15Experimental Evaluation Data
- Training Set
- 1996 New York Times archive of 137,000 newspaper
articles - Used to tune QXtract parameters
- Test Set
- 1995 New York Times archive of 135,000 newspaper
articles
16Final Configuration of QXtract, from Training
17Experimental Evaluation Information Extraction
Systems and Associated Relations
- DIPRE Brin 1998
- Headquarters(Organization, Location)
- Snowball Agichtein and Gravano 2000
- Headquarters(Organization, Location)
- Proteus Grishman et al. 2002
- DiseaseOutbreaks(DiseaseName, Location, Country,
Date, )
18Experimental Evaluation Seed Tuples
Organization Location
Microsoft Redmond
Exxon Irving
Boeing Seattle
IBM Armonk
Intel Santa Clara
DiseaseName Location
Malaria Ethiopia
Typhus Bergen-Belsen
Flu The Midwest
Mad Cow Disease The U.K.
Pneumonia The U.S.
Headquarters
DiseaseOutbreaks
19Experimental Evaluation Metrics
- Gold standard relation Rall, obtained by running
information extraction system over every document
in Dall database - Recall of Rall captured in approximation
extracted from retrieved documents - Precision of retrieved documents that are
useful (i.e., produced tuples)
20Experimental Evaluation Relation Statistics
Relation and Extraction System Dall Useful Rall
Headquarters Snowball 135,000 23 24,536
Headquarters DIPRE 135,000 22 20,952
DiseaseOutbreaks Proteus 135,000 4 8,859
21Alternative Query Generation Strategies
- QXtract, with final configuration from training
- Tuples Keep deriving queries from extracted
tuples - Problem disconnected databases
- Patterns Derive queries from extraction patterns
from information extraction system - ltORGANIZATIONgt, based in ltLOCATIONgt gt based
in - Problems pattern features often not suitable for
querying, or not visible from black-box
extraction system - Manual Construct queries manually MUC
- Obtained for Proteus from developers
- Not available for DIPRE and Snowball
- Plus simple additional baseline retrieve a
random document sample of appropriate size
22Recall and Precision Headquarters Relation
Snowball Extraction System
Precision
Recall
23Recall and Precision Headquarters Relation
DIPRE Extraction System
Precision
Recall
24Extraction Efficiency and RecallDiseaseOutbreaks
Relation Proteus Extraction System
60 of relation extracted from just 10 of
documents of 135,000 newspaper article database
25Snowball/Headquarters Queries
26DIPRE/Headquarters Queries
27Proteus/DiseaseOutbreaks Queries
28Current Work Characterizing Databases for an
Extraction Task
Sparse?
yes
no
Scan
QXtract, Tuples
Connected?
yes
no
Tuples
QXtract
29Related Work
- Information Extraction focus on quality of
extracted relations MUC most relevant
sub-task text filtering - Filters derived from extraction patterns, or
consisting of words (manually created or from
supervised learning) - Grishman et al.s manual pattern-based filters
for disease outbreaks - Related to Manual and Patterns strategies in our
experiments - Focus not on querying using simple search
interface - Information Retrieval focus on relevant
documents for queries - In our scenario, relevance determined by
extraction task and associated information
extraction system - Automatic Query Generation several efforts for
different tasks - Minority language corpora construction Ghani et
al. 2001 - Topic-specific document search (e.g., Cohen
Singer 1996)
30Contributions An Unsupervised Query-Based
Technique for Efficient Information Extraction
- Adapts to arbitrary underlying information
extraction system and document database - Can work over non-crawlable Hidden Web
databases - Minimal user input required
- Handful of example tuples
- Can trade off relation completeness and
extraction efficiency - Particularly interesting in conjunction with
unsupervised/bootstrapping-based information
extraction systems (e.g., DIPRE, Snowball)
31Questions?
32Overflow Slides
33Related Work (II)
- Focused Crawling (e.g., Chakrabarti et al.
2002) uses link and page classification to
crawl pages on a topic - Hidden-Web Crawling Raghavan Garcia-Molina
2001 retrieves pages from non-crawlable
Hidden-Web databases - Need rich query interface, with distinguishable
attributes - Related to Tuples strategy, but tuples derived
from pull-down menus, etc. from search interfaces
as found - Our goal retrieve as few documents as possible
from one database to extract relation - Question-Answering Systems
34Related Work (III)
- Mitchell, Riloff, et al. 1998 use linguistic
phrases derived from information extraction
patterns as features for text categorization - Related to Patterns strategy requires document
parsing, so cant directly generate simple
queries - Gaizauskas Robertson 1997 use 9 manually
generated keywords to search for documents
relevant to a MUC extraction task
35Recall and Precision DiseaseOutbreaks Relation
Proteus Extraction System
Recall
Precision
36Running Times
37Extracting Relations from Text Snowball
- Exploit redundancy on web to focus on easy
instances - Require only minimal training (handful of seed
tuples)
ACM DL00
Initial Seed Tuples
Occurrences of Seed Tuples
Tag Entities
Generate New Seed Tuples
Generate Extraction Patterns
Augment Table