Modeling QueryBased Access to Text Databases

About This Presentation

Title:

Modeling QueryBased Access to Text Databases

Description:

Presented graph model for query-based algorithms: for Information Extraction ... such algorithms. Presented single reachability metric to predict success of ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 18

Provided by: euge86

Category:

more less

Transcript and Presenter's Notes

Title: Modeling QueryBased Access to Text Databases

1
Modeling Query-Based Access to Text Databases

Eugene AgichteinPanagiotis IpeirotisLuis
Gravano
Computer Science Department
Columbia University

2
Extracting Structured Information Buried in
Text Documents
May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Information Extraction System (e.g., NYUs
Proteus)
3
Extracting All Tuples of a Relation from a Text
Database

Naïve approach feed every document to
information extraction system. At 7
secs./document, Proteus takes over 8 days for
100K documents
Only a tiny fraction of documents contains tuples
? Processing every document is inefficient
Many databases are not crawlable (scannable), but
available only via a search engine.

Search engines can helpefficiency and
accessibility
4
A Query-Based Strategy for Information
Extraction Agichtein and Gravano, ICDE 2003
0 Start with some seed tuples (e.g., ltMay
1995, Ebola, Zairegt)

1 While seed has unprocessed tuple t
2 Retrieve up to MaxResults documents
using query derived from t
3 Extract new tuples te from these
documents
4 Augment seed with te

seed
t0
t1
t2
Potential problem May run out of tuples (and
queries) ? incomplete relation!
5
Iterative Methods Sometimes (but not Always)
Succeed
seed
seed
SUCCESS!
FAIL ?
Can we predict if a query-based strategy will
succeed?
6
Model Querying Graph
Tokens
Documents
t1
d1

Tokens Tuple attributes
ltMay 1995, Ebola, Zairegt
Each Token (as query) retrieves documents
Documents contain tokens

d2
t2
t3
d3
t4
d4
t5
d5
7
Model Reachability Graph
Tokens
Documents
t1
t1
d1
t2
t3
d2
t2
t3
d3
t4
t5
t4
d4
t1 retrieves document d1 that contains t2
t2, t3, and t4 reachable from t1
t5
d5
8
Model Connected Components
t1
t2
t3
t4
Tokens not in Core, but are reachable from Core
Tokens not in Core but from which Core is
reachable
9
Components of Reachability Graph
Core
Out
In
t0
(strongly
connected)
Out
In
Core
How many tokens are in the largest Core Out?
Out
In
Core
10
Model Power-law Graphs

Conjecture Degree distribution in the
reachability graph follows power-law
(nodes with degree k) O(k-ß)
(i.e., many nodes with small degree, a few nodes
with large degree)
Power-law random graphs are expected to have at
most one giant connected component
(CoreInOut). Other connected components are
small.

11
Model Reachability
Core
t0
Out
In
(strongly
connected)

Reachability
Fraction of tokens in the largest Core Out
(Power law allows to ignore small components)

12
Estimating Reachability

In a power-law random graph G a giant component
CG emerges if the average outdegree d gt 1
Graph theory results predict relative size of CG

Chung and Lu, Annals of Combinatorics, 2002
Estimate reachability as relative size of CG,
which reduces to estimating average outdegree of
reachability graph
13
Estimating Reachability Using Sampling(estimate
average outdegree)
Tokens
Documents

Choose S random seed tokens
Query the database for seed
Extract tokens to compute the reachability graph
edges for seed tokens.
Estimate d as average outdegree of seed tokens.
Estimate reachability

t1
d1
t1
d2
t2
t3
t3
d3
t4
d4
t2
t2
d 1.5
t5
d5
t4
14
Experimental Results Verifying the Power-law
Conjecture
Task 1 NYT DiseaseOutbreaks (Date, Disease,
Location) New York Times, 1995 T 8,859
D137,000
Follows the power-law distribution
15
Experimental ResultsEstimating Reachability by
Sampling

Approximate reachability isestimated with S 50
tokens
The reachability correctly predicts performance
of query-based information extraction strategy
If the estimated reachability is too low,can
switch to a different strategy early

16
Future Work
Tokens
Documents

What if we have only limited access to the
database?
Limit on number of queries
Limit on number of documents retrieved
Not modelled by reachability graph, but can be
modelled using properties of querying graph

t1
d1
d2
t2
t3
d3
t4
d4
t5
d5
17
Summary

Presented graph model for query-based algorithms
for Information Extraction
for Constructing Database Content Summaries
Showed that querying and reachability graphs can
be used to analyze such algorithms
Presented single reachability metric to predict
success of iterative query-based algorithms
Presented and verified conjecture that
reachability graphs for these algorithms follow
the power law
Presented efficient techniques for estimating
reachability by exploiting properties of
power-law random graphs