Modeling QueryBased Access to Text Databases - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Modeling QueryBased Access to Text Databases

Description:

Presented graph model for query-based algorithms: for Information Extraction ... such algorithms. Presented single reachability metric to predict success of ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 18
Provided by: euge86
Category:

less

Transcript and Presenter's Notes

Title: Modeling QueryBased Access to Text Databases


1
Modeling Query-Based Access to Text Databases
  • Eugene AgichteinPanagiotis IpeirotisLuis
    Gravano
  • Computer Science Department
  • Columbia University

2
Extracting Structured Information Buried in
Text Documents
May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Information Extraction System (e.g., NYUs
Proteus)
3
Extracting All Tuples of a Relation from a Text
Database
  • Naïve approach feed every document to
    information extraction system. At 7
    secs./document, Proteus takes over 8 days for
    100K documents
  • Only a tiny fraction of documents contains tuples
    ? Processing every document is inefficient
  • Many databases are not crawlable (scannable), but
    available only via a search engine.

Search engines can helpefficiency and
accessibility
4
A Query-Based Strategy for Information
Extraction Agichtein and Gravano, ICDE 2003
0 Start with some seed tuples (e.g., ltMay
1995, Ebola, Zairegt)
  • 1 While seed has unprocessed tuple t
  • 2 Retrieve up to MaxResults documents
    using query derived from t
  • 3 Extract new tuples te from these
    documents
  • 4 Augment seed with te

seed
t0
t1
t2
Potential problem May run out of tuples (and
queries) ? incomplete relation!
5
Iterative Methods Sometimes (but not Always)
Succeed
seed
seed
SUCCESS!
FAIL ?
Can we predict if a query-based strategy will
succeed?
6
Model Querying Graph
Tokens
Documents
t1
d1
  • Tokens Tuple attributes
  • ltMay 1995, Ebola, Zairegt
  • Each Token (as query) retrieves documents
  • Documents contain tokens

d2
t2
t3
d3
t4
d4
t5
d5
7
Model Reachability Graph
Tokens
Documents
t1
t1
d1
t2
t3
d2
t2
t3
d3
t4
t5
t4
d4
t1 retrieves document d1 that contains t2
t2, t3, and t4 reachable from t1
t5
d5
8
Model Connected Components
t1
t2
t3
t4
Tokens not in Core, but are reachable from Core
Tokens not in Core but from which Core is
reachable
9
Components of Reachability Graph
Core
Out
In
t0
(strongly
connected)
Out
In
Core
How many tokens are in the largest Core Out?
Out
In
Core
10
Model Power-law Graphs
  • Conjecture Degree distribution in the
    reachability graph follows power-law
  • (nodes with degree k) O(k-ß)
  • (i.e., many nodes with small degree, a few nodes
    with large degree)
  • Power-law random graphs are expected to have at
    most one giant connected component
    (CoreInOut). Other connected components are
    small.

11
Model Reachability
Core
t0
Out
In
(strongly
connected)
  • Reachability
  • Fraction of tokens in the largest Core Out
  • (Power law allows to ignore small components)

12
Estimating Reachability
  • In a power-law random graph G a giant component
    CG emerges if the average outdegree d gt 1
  • Graph theory results predict relative size of CG

Chung and Lu, Annals of Combinatorics, 2002
Estimate reachability as relative size of CG,
which reduces to estimating average outdegree of
reachability graph
13
Estimating Reachability Using Sampling(estimate
average outdegree)
Tokens
Documents
  • Choose S random seed tokens
  • Query the database for seed
  • Extract tokens to compute the reachability graph
    edges for seed tokens.
  • Estimate d as average outdegree of seed tokens.
  • Estimate reachability

t1
d1
t1
d2
t2
t3
t3
d3
t4
d4
t2
t2
d 1.5
t5
d5
t4
14
Experimental Results Verifying the Power-law
Conjecture
Task 1 NYT DiseaseOutbreaks (Date, Disease,
Location) New York Times, 1995 T 8,859
D137,000
Follows the power-law distribution
15
Experimental ResultsEstimating Reachability by
Sampling
  • Approximate reachability isestimated with S 50
    tokens
  • The reachability correctly predicts performance
    of query-based information extraction strategy
  • If the estimated reachability is too low,can
    switch to a different strategy early


16
Future Work
Tokens
Documents
  • What if we have only limited access to the
    database?
  • Limit on number of queries
  • Limit on number of documents retrieved
  • Not modelled by reachability graph, but can be
    modelled using properties of querying graph

t1
d1
d2
t2
t3
d3
t4
d4
t5
d5
17
Summary
  • Presented graph model for query-based algorithms
  • for Information Extraction
  • for Constructing Database Content Summaries
  • Showed that querying and reachability graphs can
    be used to analyze such algorithms
  • Presented single reachability metric to predict
    success of iterative query-based algorithms
  • Presented and verified conjecture that
    reachability graphs for these algorithms follow
    the power law
  • Presented efficient techniques for estimating
    reachability by exploiting properties of
    power-law random graphs
Write a Comment
User Comments (0)
About PowerShow.com