Querying Text Databases for Efficient Information Extraction - PowerPoint PPT Presentation

About This Presentation
Title:

Querying Text Databases for Efficient Information Extraction

Description:

Over newspaper archives: tracking disease outbreaks, terrorist attacks; intelligence ... Flu. Bergen-Belsen. Typhus. Ethiopia. Malaria. Headquarters ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 38
Provided by: luisgr
Category:

less

Transcript and Presenter's Notes

Title: Querying Text Databases for Efficient Information Extraction


1
Querying Text Databases for Efficient
Information Extraction
  • Eugene AgichteinLuis Gravano
  • Columbia University

2
Extracting Structured Information Buried in
Text Documents
Organization
Location
Microsoft's central headquarters in Redmond is
home to almost every product group and division.
Microsoft Apple Computer Nike
Redmond Cupertino Portland
Brent Barlow, 27, a software analyst and
beta-tester at Apple Computers headquarters in
Cupertino, was fired Monday for "thinking a
little too different."
Apple's programmers "think different" on a
"campus" in Cupertino, Cal. Nike employees "just
do it" at what the company refers to as its
"World Campus," near Portland, Ore.
3
Information Extraction Applications
  • Over a corporations customer report or email
    complaint database enabling sophisticated
    querying and analysis
  • Over biomedical literature identifying
    drug/condition interactions
  • Over newspaper archives tracking disease
    outbreaks, terrorist attacks intelligence

Significant progress over the last decade MUC
4
Information Extraction Example Organizations
Headquarters
Input Documents
Named-Entity Tagging
Pattern Matching
Output Tuples
5
Goal Extract All Tuples of a Relation from a
Document Database
InformationExtraction System
Extracted Tuples
  • One approach feed every document to information
    extraction system
  • Problem efficiency!

6
Information Extraction is Expensive
  • Efficiency is a problem even after training
    information extraction system
  • Example NYUs Proteus extraction system takes
    around 9 seconds per document
  • Over 15 days to process 135,000 news articles
  • Filtering before further processing a document
    might help
  • Cant afford to scan the web to process each
    page!
  • Hidden-Web databases dont allow crawling

7
Information Extraction Without Processing All
Documents
  • Observation Often only small fraction of
    database is relevant for an extraction task
  • Our approach Exploit database search engine to
    retrieve and process only promising documents

8
Architecture of our QXtract System
User-Provided Seed Tuples
Microsoft Redmond
Apple Cupertino
Query Generation
Queries
Key problem Learn queries to retrieve
promising documents
Promising Documents
Information Extraction
Microsoft Redmond
Apple Cupertino
Exxon Irving
IBM Armonk
Intel Santa Clara
Extracted Relation
9
Generating Queries to Retrieve Promising Documents
User-Provided Seed Tuples
Seed Sampling
  1. Get document sample with likely negative and
    likely positive examples.
  2. Label sample documents usinginformation
    extraction systemas oracle.
  3. Train classifiers to recognizeuseful
    documents.
  4. Generate queries from classifiermodel/rules.

Information Extraction
Classifier Training
Query Generation
Queries
10
Getting a Training Document Sample
User-Provided Seed Tuples
Microsoft AND Redmond
Apple AND Cupertino
Random Queries
Get document sample with likely negative and
likely positive examples.
User-Provided Seed Tuples
Seed Sampling
11
Labeling the Training Document Sample
Information Extraction System
Use information extraction system as oracle to
label examples as true positive and true
negative.
Microsoft Redmond
Apple Cupertino
IBM Armonk
12
Training Classifiers to Recognize Useful
Documents

is based in near city
spokesperson reported news earnings release
products made used exported far
past old homerun sponsored event

Document features words
-
-
Ripper
SVM
Okapi (IR)
based 3
spokesperson 2
sponsored -1
sponsored
based AND near gt Useful
is
homerun
Classifier Training
near
event
earnings
based
far
spokesperson
13
Generating Queries from Classifiers
Ripper
SVM
Okapi (IR)
based 3
spokesperson 2
sponsored -1
based AND near gt Useful
sponsored
is
homerun
near
event
earnings
based
far
spokesperson
spokespersonearnings
based AND near
basedspokesperson
QCombined
based AND nearspokespersonbased
Query Generation
Queries
14
Architecture of our QXtract System
User-Provided Seed Tuples
Microsoft Redmond
Apple Cupertino
Query Generation
Queries
Promising Documents
Information Extraction
Microsoft Redmond
Apple Cupertino
Exxon Irving
IBM Armonk
Intel Santa Clara
Extracted Relation
15
Experimental Evaluation Data
  • Training Set
  • 1996 New York Times archive of 137,000 newspaper
    articles
  • Used to tune QXtract parameters
  • Test Set
  • 1995 New York Times archive of 135,000 newspaper
    articles

16
Final Configuration of QXtract, from Training
17
Experimental Evaluation Information Extraction
Systems and Associated Relations
  • DIPRE Brin 1998
  • Headquarters(Organization, Location)
  • Snowball Agichtein and Gravano 2000
  • Headquarters(Organization, Location)
  • Proteus Grishman et al. 2002
  • DiseaseOutbreaks(DiseaseName, Location, Country,
    Date, )

18
Experimental Evaluation Seed Tuples
Organization Location
Microsoft Redmond
Exxon Irving
Boeing Seattle
IBM Armonk
Intel Santa Clara
DiseaseName Location
Malaria Ethiopia
Typhus Bergen-Belsen
Flu The Midwest
Mad Cow Disease The U.K.
Pneumonia The U.S.
Headquarters
DiseaseOutbreaks
19
Experimental Evaluation Metrics
  • Gold standard relation Rall, obtained by running
    information extraction system over every document
    in Dall database
  • Recall of Rall captured in approximation
    extracted from retrieved documents
  • Precision of retrieved documents that are
    useful (i.e., produced tuples)

20
Experimental Evaluation Relation Statistics
Relation and Extraction System Dall Useful Rall
Headquarters Snowball 135,000 23 24,536
Headquarters DIPRE 135,000 22 20,952
DiseaseOutbreaks Proteus 135,000 4 8,859
21
Alternative Query Generation Strategies
  • QXtract, with final configuration from training
  • Tuples Keep deriving queries from extracted
    tuples
  • Problem disconnected databases
  • Patterns Derive queries from extraction patterns
    from information extraction system
  • ltORGANIZATIONgt, based in ltLOCATIONgt gt based
    in
  • Problems pattern features often not suitable for
    querying, or not visible from black-box
    extraction system
  • Manual Construct queries manually MUC
  • Obtained for Proteus from developers
  • Not available for DIPRE and Snowball
  • Plus simple additional baseline retrieve a
    random document sample of appropriate size

22
Recall and Precision Headquarters Relation
Snowball Extraction System
Precision
Recall
23
Recall and Precision Headquarters Relation
DIPRE Extraction System
Precision
Recall
24
Extraction Efficiency and RecallDiseaseOutbreaks
Relation Proteus Extraction System
60 of relation extracted from just 10 of
documents of 135,000 newspaper article database
25
Snowball/Headquarters Queries
26
DIPRE/Headquarters Queries
27
Proteus/DiseaseOutbreaks Queries
28
Current Work Characterizing Databases for an
Extraction Task
Sparse?
yes
no
Scan
QXtract, Tuples
Connected?
yes
no
Tuples
QXtract
29
Related Work
  • Information Extraction focus on quality of
    extracted relations MUC most relevant
    sub-task text filtering
  • Filters derived from extraction patterns, or
    consisting of words (manually created or from
    supervised learning)
  • Grishman et al.s manual pattern-based filters
    for disease outbreaks
  • Related to Manual and Patterns strategies in our
    experiments
  • Focus not on querying using simple search
    interface
  • Information Retrieval focus on relevant
    documents for queries
  • In our scenario, relevance determined by
    extraction task and associated information
    extraction system
  • Automatic Query Generation several efforts for
    different tasks
  • Minority language corpora construction Ghani et
    al. 2001
  • Topic-specific document search (e.g., Cohen
    Singer 1996)

30
Contributions An Unsupervised Query-Based
Technique for Efficient Information Extraction
  • Adapts to arbitrary underlying information
    extraction system and document database
  • Can work over non-crawlable Hidden Web
    databases
  • Minimal user input required
  • Handful of example tuples
  • Can trade off relation completeness and
    extraction efficiency
  • Particularly interesting in conjunction with
    unsupervised/bootstrapping-based information
    extraction systems (e.g., DIPRE, Snowball)

31
Questions?
32
Overflow Slides
33
Related Work (II)
  • Focused Crawling (e.g., Chakrabarti et al.
    2002) uses link and page classification to
    crawl pages on a topic
  • Hidden-Web Crawling Raghavan Garcia-Molina
    2001 retrieves pages from non-crawlable
    Hidden-Web databases
  • Need rich query interface, with distinguishable
    attributes
  • Related to Tuples strategy, but tuples derived
    from pull-down menus, etc. from search interfaces
    as found
  • Our goal retrieve as few documents as possible
    from one database to extract relation
  • Question-Answering Systems

34
Related Work (III)
  • Mitchell, Riloff, et al. 1998 use linguistic
    phrases derived from information extraction
    patterns as features for text categorization
  • Related to Patterns strategy requires document
    parsing, so cant directly generate simple
    queries
  • Gaizauskas Robertson 1997 use 9 manually
    generated keywords to search for documents
    relevant to a MUC extraction task

35
Recall and Precision DiseaseOutbreaks Relation
Proteus Extraction System
Recall
Precision
36
Running Times
37
Extracting Relations from Text Snowball
  • Exploit redundancy on web to focus on easy
    instances
  • Require only minimal training (handful of seed
    tuples)

ACM DL00
Initial Seed Tuples
Occurrences of Seed Tuples
Tag Entities
Generate New Seed Tuples
Generate Extraction Patterns
Augment Table
Write a Comment
User Comments (0)
About PowerShow.com