Title: Surfacing Information in Large Text Collections
1Surfacing Information in Large Text Collections
- Eugene Agichtein
- Microsoft Research
2Example Angina treatments
Structured databases (e.g., drug info, WHO drug
adverse effects DB, etc)
Medical reference and literature
Web search results
3Research Goal
- Seamless, intuitive, efficient, and robust access
to knowledge in unstructured sources - Some approaches
- Retrieve the relevant documents or passages
- Question answering
- Construct domain-specific verticals (MedLine)
- Extract entities and relationships
- Network of relationships Semantic Web
4Semantic Relationships Buried in Unstructured
Text
RecommendedTreatment
A number of well-designed and -executed
large-scale clinical trials have now shown that
treatment with statins reduces recurrent
myocardial infarction, reduces strokes, and
lessens the need for revascularization or
hospitalization for unstable angina pectoris
- Web, newsgroups, web logs
- Text databases (PubMed, CiteSeer, etc.)
- Newspaper Archives
- Corporate mergers, succession, location
- Terrorist attacks
5What Structured Representation Can Do for You
Structured Relation
- allow precise and efficient querying
- allow returning answers instead of documents
- support powerful query constructs
- allow data integration with (structured) RDBMS
- provide useful content for Semantic Web
6Challenges in Information Extraction
- Portability
- Reduce effort to tune for new domains and tasks
- MUC systems experts would take 8-12 weeks to
tune - Scalability, Efficiency, Access
- Enable information extraction over large
collections - 1 sec / document 5 billion docs 158 CPU years
- Approach learn from data ( Bootstrapping )
- Snowball Partially Supervised Information
Extraction - Querying Large Text Databases for Efficient
Information Extraction
7The Snowball System Overview
Snowball
... ... ..
8Snowball Getting User Input
ACM DL 2000
- User input
- a handful of example instances
- integrity constraints on the relation e.g.,
Organization is a key, Age 0, etc
9Evaluating Patterns and TuplesExpectation
Maximization
- EM-Spy Algorithm
- Hide labels for some seed tuples
- Iterate EM algorithm to convergence on
tuple/pattern confidence values - Set threshold t such that (t 90 of spy
tuples) - Re-initialize Snowball using new seed tuples
..
10Adapting Snowball for New Relations
- Large parameter space
- Initial seed tuples (randomly chosen, multiple
runs) - Acceptor features words, stems, n-grams,
phrases, punctuation, POS - Feature selection techniques OR, NB, Freq,
support, combinations - Feature weights TFIDF, TF, TFNB, NB
- Pattern evaluation strategies NN, Constraint
violation, EM, EM-Spy - Automatically estimate parameter values
- Estimate operating parameters based on
occurrences of seed tuples - Run cross-validation on hold-out sets of seed
tuples for optimal perf. - Seed occurrences that do not have close
neighbors are discarded
11Example Task 1 DiseaseOutbreaks
SDM 2006
Proteus 0.409 Snowball 0.415
12Example Task 2 Bioinformatics
ISMB 2003
APO-1, also known as DR6MEK4, also called
SEK1
- 100,000 gene and protein synonyms extracted from
50,000 journal articles - Approximately 40 of confirmed synonyms not
previously listed in curated authoritative
reference (SWISSPROT)
13Snowball Used in Various Domains
- News NYT, WSJ, AP DL00, SDM06
- CompanyHeadquarters, MergersAcquisitions,
DiseaseOutbreaks - Medical literature PDRHealth, Micromedex Ph.D.
Thesis - AdverseEffects, DrugInteractions,
RecommendedTreatments - Biological literature GeneWays corpus ISMB03
- Gene and Protein Synonyms
14Limits of Bootstrapping for Extraction
CIKM 2005
- Task easy when context term distributions
diverge from background - Quantify as relative entropy (Kullback-Liebler
divergence) - After calibration, metric predicts if
bootstrapping likely to work
15Extracting All Relation Instances From a Text
Database
InformationExtraction System
StructuredRelation
- Brute force approach feed all docs to
information extraction system - Only a tiny fraction of documents are often
useful - Many databases are not crawlable
- Often a search interface is available, with
existing keyword index - How to identify useful documents?
16Accessing Text DBs via Search Engines
InformationExtraction System
Search Engine
- Search engines impose limitations
- Limit on documents retrieved per query
- Support simple keywords and phrases
- Ignore stopwords (e.g., a, is)
StructuredRelation
17Text-Centric Task I Information Extraction
- Information extraction applications extract
structured relations from unstructured text
May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Disease Outbreaks in The New York Times
Information Extraction System (e.g., NYUs
Proteus)
Information Extraction tutorial yesterday by
AnHai Doan, Raghu Ramakrishnan, Shivakumar
Vaithyanathan
18Executing a Text-Centric Task
Text Database
Extraction System
- Retrieve documents from database
- Two major execution paradigms
- Scan-based Retrieve and process documents
sequentially - Index-based Query database (e.g., case
fatality rate), retrieve and process
documents in results
- Similar to relational world
?underlying data distribution dictates what is
best
- Indexes are only approximate index is on
keywords, not on tokens of interest - Choice of execution plan affects output
completeness (not only speed)
Unlike the relational world
19QXtract Querying Text Databases for Robust
Scalable Information EXtraction
User-Provided Seed Tuples
Query Generation
Queries
Promising Documents
Information Extraction System
Problem Learn keyword queries to retrieve
promising documents
Extracted Relation
20Learning Queries to Retrieve Promising Documents
User-Provided Seed Tuples
- Get document sample with likely negative and
likely positive examples. - Label sample documents using information
extraction system as oracle. - Train classifiers to recognize useful
documents. - Generate queries from classifier model/rules.
Seed Sampling
Information Extraction System
Classifier Training
Query Generation
Queries
21SIGMOD 2003 Demonstration
22Querying Graph
Tokens
Documents
t1
d1
- The querying graph is a bipartite graph,
containing tokens and documents - Each token (transformed to a keyword query)
retrieves documents - Documents contain tokens
d2
t2
t3
d3
t4
d4
t5
d5
23Sizes of Connected Components
How many tuples are in largest Core Out?
Out
In
Out
In
Core
Core
t0
Out
In
(strongly
Core
connected)
- Conjecture
- Degree distribution in reachability graphs
follows power-law. - Then, reachability graph has at most one giant
component. - Define Reachability as Fraction of tuples in
largest Core Out
24NYT Reachability Graph Outdegree Distribution
Matches the power-law distribution
MaxResults10
MaxResults50
25NYT Component Size Distribution
Not reachable
reachable
MaxResults10
MaxResults50
CG / T 0.297
CG / T 0.620
26Connected Components Visualization
DiseaseOutbreaks, New York Times 1995
27Estimate Cost of Retrieval Methods
SIGMOD 2006
- Alternatives
- Scan, Filtered Scan, Tuples, QXtract
- General cost model for text-centric tasks
- Information extraction, summary construction,
etc - Estimate the expected cost of each access method
- Parametric model describing all retrieval steps
- Extended analysis to arbitrary degree
distributions - Parameters estimates can be piggybacked at
runtime - Cost estimates can be provided to a query
optimizer for nearly optimal execution
28Optimized Execution of Text-Centric Tasks
Scan
Filtered Scan
Tuples
29Current Research Agenda
- Seamless, intuitive, and robust access to
knowledge in biologicial and medical sources - Some research problems
- Robust query processing over unstructured data
- Intelligently interpreting user information needs
- Text mining for bio- and medical informatics
- Model implicit network structures
- Entity graphs in Wikipedia
- Protein-Protein interaction networks
- Semantic maps of MedLine
30Deriving Actionable Knowledge from Unstructured
(text) Data
- Extract actionable rules from medical
text(Medline, patient reports, ) - Joint project (early stages) with medical school,
GT - Epidemiology surveillance (w/ SPH)
- Query processing over unstructured data
- Tune extraction for query workload
- Index structures to support effective extraction
- Queries over extracted and native tables
31Text Mining for Bioinformatics
- Impossible to keep up with literature,
experimental notes - Automatically update ontologies, indexes
- Automate tedious work of post-wetlab search
- Identify (and assign text label) DNA structures
32Mining Text and Sequence Data
PSB 2004
ROC50 scores for each class and method