Title: Modeling QueryBased Access to Text Databases
1Modeling Query-Based Access to Text Databases
- Eugene AgichteinPanagiotis IpeirotisLuis
Gravano - Computer Science Department
- Columbia University
2Extracting Structured Information Buried in
Text Documents
May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Information Extraction System (e.g., NYUs
Proteus)
3Extracting All Tuples of a Relation from a Text
Database
- Naïve approach feed every document to
information extraction system. At 7
secs./document, Proteus takes over 8 days for
100K documents - Only a tiny fraction of documents contains tuples
? Processing every document is inefficient - Many databases are not crawlable (scannable), but
available only via a search engine.
Search engines can helpefficiency and
accessibility
4A Query-Based Strategy for Information
Extraction Agichtein and Gravano, ICDE 2003
0 Start with some seed tuples (e.g., ltMay
1995, Ebola, Zairegt)
- 1 While seed has unprocessed tuple t
- 2 Retrieve up to MaxResults documents
using query derived from t - 3 Extract new tuples te from these
documents - 4 Augment seed with te
seed
t0
t1
t2
Potential problem May run out of tuples (and
queries) ? incomplete relation!
5Iterative Methods Sometimes (but not Always)
Succeed
seed
seed
SUCCESS!
FAIL ?
Can we predict if a query-based strategy will
succeed?
6Model Querying Graph
Tokens
Documents
t1
d1
- Tokens Tuple attributes
- ltMay 1995, Ebola, Zairegt
- Each Token (as query) retrieves documents
- Documents contain tokens
d2
t2
t3
d3
t4
d4
t5
d5
7Model Reachability Graph
Tokens
Documents
t1
t1
d1
t2
t3
d2
t2
t3
d3
t4
t5
t4
d4
t1 retrieves document d1 that contains t2
t2, t3, and t4 reachable from t1
t5
d5
8Model Connected Components
t1
t2
t3
t4
Tokens not in Core, but are reachable from Core
Tokens not in Core but from which Core is
reachable
9Components of Reachability Graph
Core
Out
In
t0
(strongly
connected)
Out
In
Core
How many tokens are in the largest Core Out?
Out
In
Core
10Model Power-law Graphs
- Conjecture Degree distribution in the
reachability graph follows power-law - (nodes with degree k) O(k-ß)
- (i.e., many nodes with small degree, a few nodes
with large degree) - Power-law random graphs are expected to have at
most one giant connected component
(CoreInOut). Other connected components are
small.
11Model Reachability
Core
t0
Out
In
(strongly
connected)
- Reachability
- Fraction of tokens in the largest Core Out
- (Power law allows to ignore small components)
12Estimating Reachability
- In a power-law random graph G a giant component
CG emerges if the average outdegree d gt 1 - Graph theory results predict relative size of CG
Chung and Lu, Annals of Combinatorics, 2002
Estimate reachability as relative size of CG,
which reduces to estimating average outdegree of
reachability graph
13Estimating Reachability Using Sampling(estimate
average outdegree)
Tokens
Documents
- Choose S random seed tokens
- Query the database for seed
- Extract tokens to compute the reachability graph
edges for seed tokens. - Estimate d as average outdegree of seed tokens.
- Estimate reachability
t1
d1
t1
d2
t2
t3
t3
d3
t4
d4
t2
t2
d 1.5
t5
d5
t4
14Experimental Results Verifying the Power-law
Conjecture
Task 1 NYT DiseaseOutbreaks (Date, Disease,
Location) New York Times, 1995 T 8,859
D137,000
Follows the power-law distribution
15Experimental ResultsEstimating Reachability by
Sampling
- Approximate reachability isestimated with S 50
tokens - The reachability correctly predicts performance
of query-based information extraction strategy - If the estimated reachability is too low,can
switch to a different strategy early
16Future Work
Tokens
Documents
- What if we have only limited access to the
database? - Limit on number of queries
- Limit on number of documents retrieved
- Not modelled by reachability graph, but can be
modelled using properties of querying graph
t1
d1
d2
t2
t3
d3
t4
d4
t5
d5
17Summary
- Presented graph model for query-based algorithms
- for Information Extraction
- for Constructing Database Content Summaries
- Showed that querying and reachability graphs can
be used to analyze such algorithms - Presented single reachability metric to predict
success of iterative query-based algorithms - Presented and verified conjecture that
reachability graphs for these algorithms follow
the power law - Presented efficient techniques for estimating
reachability by exploiting properties of
power-law random graphs