Title: Efficiently gathering information on the Internet using AI
1Efficiently gathering information on the Internet
using AI DB techniquesThe EMERAC PROJECT
Subbarao Kambhampati Arizona State
University Tempe, USA rakaposhi.eas.asu.edu/yochan
.html
2Motivation
- A lot of data on the WWW is actually generated by
databases - 80 of the web is hidden. Dynamically generated
in response to queries from users - Would be nifty if we can do database-style
querying of the web.
3Information Gathering
user
Gatherer
wrapper
db
wrapper
lthtmlgt
cgi
4Meta-search engines --Text-based --No access to
hidden web
5Junglee
Netbot
Comparison shoppers --Call every source
--Collate results
DealPilot.Com
6- CORA allows more sophisticated queries
- All papers that cite Rao, but are not written by
rao - Neither CORA nor DBLP are complete
- CORA tends to be more complete for online papers
and AI papers - DBLP sticks to published papers, and is
- more complete in DB coverage
- Both sources provide Englishified-BIBTEX
citations
7Data representation
- Global as View (GAV)
- The global (mediated) schema is written as a view
on the sources (databases) - Simpler query processing (not reformulation)
- Less modular
- (schema changes
- when new sources are added)
8Data Representation --2
- Local as View (LAV)
- Modular
- New sources can be added without changing the
global schema - Needs more sophisticated query processing
- User query needs to be reformulated into source
calls - Compiling LAV into GAV
9Data location
- Warehousing vs. Virtual (on-line) sources
- Warehousing avoids the problems of the net
- Data may get stale.
- There may be too much data.
- You may not be allowed to shift the data over
- Virtual Source method accesses the sources on
demand - Has to handle internet problems such as
- bursty traffic, setup delays etc.
- Hybrid
- Selective materialization Caching etc...
10Tricky issues
- Sources are not really databases!
- Legacy systems
- Limited access patters
- (Cans ask a white-pages source for the list of
all numbers) - Limited local processing power
- Typically only selections (on certain attributes)
are supported - Sources are autonomous
- Unregulated data overlap
- Lack of full statistics on the sources
11EMERAC Query Planning System
Build query plan using source inversion
Execution Optimizations Source call ordering
Logical OptimizationsRedundancy removal
Execute query plan
Duschka (with Genesereth Levy) 97
Optimization steps
12Desirable Properties of information gathering
plans
- Source-complete no other plan returns more
information using the available sources - Different from the traditional query
equivalence requirement - Source-minimal a plan for which no information
source can be removed, yet the plan returns the
same answer. - Access-cost minimal a plan which reduces the
number of separate accesses to individual sources - Bandwidth-minimal a plan that, when executed,
transfers the smallest amount of data over the
network yet is still source complete
13Ensuring properties of optimal information
gathering plans
Build query plan
Logical Optimizations
Execution Optimizations
Execute query plan
Source completeness
Source-minimality
Access cost and bandwidth minimality
14Modeling Information Gathering in EMERAC
- Information sources
- relational
- answer select queries (possibly a restricted
set of query patterns) - autonomous
- World model
- relational
- Query on the world model
- Reformulate the query as calls on information
sources. Optimize. Execute.
Local as View model
15Source Access Limitations
- Sources can have a variety of access limitations
- Form interfaced databases may require certain
attributes to be bound - Whitepages may require the name of the person
- To get the numbers of a set of n people, we will
have to access the source n times - and may be unable to handle bindings of other
attributes - A Whitepages database may not take the address of
a person as a bound attribute - To get the number of John Doe, who lives on Lemon
St, we will have to get the numbers of all John
Does, and locally filter the ones not living on
Lemon Street - Wrapped web-pages cannot select over any
attributes
16Representing Source Access Limitations
- Use annotations on the attributes of the source
relation - annotation identifies attributes that must be
bound - annotation identifies un-selectable
attributes - S(X,Y,Z)
- A form-interfaced web-page that requires bindings
for X and is able to do selections only on Z. - and annotations help identify feasible
binding patterns for sources - Sb-- are feasible Sf-- are infeasible
- Sbbf must be modeled as Sbff filtered locally
with binding on Y
17Modeling Sources
Sources related to world model by describing them
as views over world model --Source description
restricted to conjunctive queries (SPJ)
movie-hut(X, Y,Z) -gt title-time(X, Y),
title-actor(X, Z)
house-of-movies(X, Y, Z) -gt title-time(X, Y),
title-actor(X, Z)
query(X, Y) - title-time(X, Y)
Required binding..
18Computing source-complete plans
- Invert the source descriptions
- Plans for individual world relations
- Concatenate the query and the source inversion
rules - A datalog program which when executed will
- return all accessible tuples
movie-hut(X, Y) -gt title-time(X, Y),
title-actor(X, Z)
title-time(X, Y) - movie-hut(X, Y,Z)
house-of-movies(X, Y) -gt title-time(X, Y),
title-actor(X, Z)
title-time(X, Y) - dom(X) , movie-hut(X, Y, Z)
19Building Source Complete Plans
Duschka, Genesereth 97
query(X, Y) - title-time(X, Y)
- movie-hut(X, Y) -gt title-time(X, Y),
title-actor(X, Z)
house-of-movies(X, Y) -gt title-time(X, Y),
title-actor(X, Z)
Source Inversion Rules
title-time(X, Y) -
movie-hut(X, Y,Z)
title-actor(X, Z) - movie-hut(X,
Y,Z)
dom(X) - movie-hut(X, Y,Z)
dom(Y) - movie-hut(X, Y,Z)
title-time(X, Y) - dom(X),
house-of-movies(X, Y)
ltX, f2(X, Y)gt
title-actor (X, X, Y) - dom(X),
house-of-movies(X, Y)
dom(Y) - dom(X), house-of-movies(X, Y)
query(X, Y) - title-time(X, Y)
Binding restrictions lead to recursion in the plan
20Complexity of finding maximally-contained plans
(Certain answers)
- Source inversion approach has poly-time
complexity for the case considered in EMERAC - Complexity doesnt depend on the query
- Can handle recursive queries just as easily
- Complexity does change if the sources are not
conjunctive queries - Sources as unions of conjunctive queries
(NP-hard) - Sources as recursive queries (Undecidable)
- Comparison predicates
- Complexity also changes based on Open vs. Closed
world assumption
21Practical Problems with Plans derived from source
inversion rules
- Every source that is remotely relevant to the
query is made part of the plan - Many of these sources may be overlapping
title-time(X, Y) -
movie-hut(X, Y) title-actor (Y, X, Y)
- movie-hut(X, Y)
dom(X) - movie-hut(X, Y)
dom(Y) - movie-hut(X, Y)
ltX, f1(X, Y)gt
- If both movie-hut and house-of-movies have same
information - both sources are not necessary
- the recursion is not necessary
title-time(X, Y) - dom(X),
house-of-movies(X, Y) title-actor (Y,
X, Y) - dom(X), house-of-movies(X, Y)
dom(Y) - dom(X),
house-of-movies(X, Y)
query(X,Y) - title-time(X, Y)
ltX, f2(X, Y)gt
22Optimization challenges in EMERAC
Traditional
Information Gathering
- Multiple sources export partial and overlapping
portions of a relation - Need to minimize plans to remove redundancy
- Sources are rarely fully relational
- Only limited types of queries allowed
- Wrapped web-pages
- Form-interfaced databases
- Certain forms of join computation may be
precluded - Need to model query capabilities
- Each relation is exported in to-to by a single
database - All sources are assumed to be fully relational
23Minimizing information gathering plans
- Model source overlaps
- Use LCW statements
- Rewrite the source-complete plan
- Greedily remove rules from plan with uniform
equivalence and LCW statements ( make the plan
source-minimal) - Uniform containment checks Sagiv, 88
- Use heuristics to guide removal and pull out
recursion first
24LCW Statements
View movie-hut(X, Y) -gt title-time(X, Y),
title-actor(X, Z) LCW movie-hut(X, Y) lt-
title-time(X, Y), title-actor(X, Z) To check if
one rule, r , with information source predicates
contains another rule, r , see if r s s l
contains r s s v
1
2
1
2
movie-hut(X, Y) lt- title-time(X, Y),
title-actor(X, Z),ZAllen
Inter-source subsumption relations Mirror
sources can also be handled
Etzioni et al 97, Duschka 97
25Testing for Uniform Containment
p(X, Y) - q(X, Y) q(X, Y) - r(X, Y)
uniformly contain
p(W, X) - r(W, X)
?
does
assert r(W, X) and try to derive p(W,
X) using bottom-up evaluation --Exponential
complexity...
26Greedily Minimizing Information Gathering Plans
- Remove non-recursive IDB predicates
- Sort the rules so those with dom predicates come
before those without dom predicates - for each rule r do
- let r be a rule of P that has not yet been
considered - let P be the program obtained by deleting rule r
from P - if Ps s l uniformly contains rs s
v then - replace P with P. Prune unreachable rules.
Source costs can be used
Uniform containment check is exponential in the
worst case
27Minimization example
title-time(X, Y) -
movie-hut(X, Y)
ltX, f1(X, Y)gt
title-actor (X, X, Y) - movie-hut(X,
Y)
dom(X) - movie-hut(X, Y)
dom(Y) - movie-hut(X, Y)
title-time(X, Y) - dom(X),
house-of-movies(X, Y)
ltX, f2(X, Y)gt
title-actor (X, X, Y) - dom(X),
house-of-movies(X, Y)
dom(Y) - dom(X), house-of-movies(X, Y)
query(X, Y) - title-time(X, Y)
movie-hut(X, Y) lt- title-time(X, Y),
title-actor(X, Z)
28LCW vs. Naïve Artificial Sources
29EMERAC
Build query plan
Logical Optimizations
Execution Optimizations
Execute query plan
Source completeness
Source-minimality
Access cost and bandwidth minimality
30Optimization challenges in EMERAC
Traditional
Information Gathering
- Multiple sources export partial and overlapping
portions of a relation - Need to minimize plans to remove redundancy
- Sources are rarely fully relational
- Only limited types of queries allowed
- Wrapped web-pages
- Form-interfaced databases
- Certain forms of join computation may be
precluded - Need to model query capabilities
- Each relation is exported in to-to by a single
database - All sources are assumed to be fully relational
31Continued
Optimization challenges in EMERAC
- Tuple-transfer costs are assumed to dominate the
query-execution costs - Use of Bound-is-easier assumption
- Assume availability of full source-statistics
- Selectivity indices, histograms etc.
- Access cost source latencies tend to equal or
dominate the transfer cost - Need to consider number of source calls
- Need for considering bushy joins (instead of just
left-linear join trees) - Full statistics are rarely available about
internet sources - Sources are decentralized and autonomous
- Difficult to do systematic optimization
32Issues in ordering source calls
- Execution cost is a function of both access cost
and the tuple-transfer cost (ignoring local
processing costs) - Tension between access costs traffic costs
- E.g. Execute S1(W,X) S2(X,Y) where the query
binds W - Tuple-transfer cost reduction motivates calling
sources with the least general binding patterns
possible - Bound-is-easier (S1 first, and then feed X
bindings to S2) - Access cost reduction motivates calling sources
with the most general binding patterns possible - Feeding X bindings for S2 will generate many
separate accesses, increasing the access cost
33Our Approach Assumptions
- Exact optimization is not worth it
- Lack of full source statistics
- NP-hardness of the optimization problem
- Join-ordering, which is a special case, is
already NP-Complete - Source access costs dominate tuple-transfer costs
by default - Reasonable given the large setup and latency
costs for internet sources
34Our Approach Overview
- A greedy approach (along the lines of
bound-is-easier type procedures) - By default, attempts to access each source with
the most general feasible binding pattern - Reasonable given the assumption that access costs
dominate transfer costs - The default is over-ridden if a binding pattern
is known to produce too much traffic - Binding patterns producing high traffic are
stored in a table called HTBP - Implicitly produces bushy join trees
35The HTBP Table
- The HTBP table contains, for every source S, the
least general binding patterns of S which are
known to produce high traffic - A call to source S with binding pattern B is
considered high-traffic producing, if HTBP
contains SB and B is either equal or more
general than B - E.g. Book(Author,Title,ISBN,Subj,Price,Pages)
- HTBP may contain all binding patterns that do not
bind at least one of the first four attributes - Bookffffbb listed explicitly in HTBP
- Bookfffffb Bookfffffbf Bookffffff would be
considered to be implicitly in HTBP - Advantage HTBP should be easy to specify even if
full source statistics are not available
36The Algorithm
For each stage i from 1 to m do For each
unchosen subgoal S pick the most general
feasible BP B of S w.r.t.
V FBP such that B is not in HTBP.
If such a B exists, Push SB
into Ci. Mark S chosen. Add
all variables of S to V If no such B
exists, but there is a feasible binding pattern
for S Pick the BP B with most
bound variables (in terms of (.))
Push SB into Pi If no subgoal has
been chosen at this level (Ci is empty),
and there are some postponed
sources (Pi is non-empty) Choose
SkB in Pi with the maximum (B) value
Push SkB into Ci Add all
variables of Sk to V Return the array C1m
Default case Reduce accesses
HTBP case Reduce transfer costs
37Example
- Sources DP(AAuthor,TTitle,YYear)
- SM98(TTitle,UURL)
- Query Q(A,T,U,1998)
- Plan Q(A,T,U,1998) - DP(A,T,1998)
SM98(T,U)
HTBP DPbbb SM98bb Step 1. VY Cand DPfff
DPffb SM98ff XX XX
XX P1 DPffb SM98ff C1
DPffb Step 2. VA,T,Y Cand SM98ff SM98bf
XX XX P2SM98bf
C2SM98bf
HTBP DPffb Step 1. VY Cand DPfff DPffb
SM98ff XX XX C1
SM98ff Step 2. VY, U, T Cand DPfff DPffb
DPfbf DPfbb XX XX
XX C2 DPfbf
HTBP Step 1. VY Cand DPfff DPffb
SM98ff C1 SM98ff
DPfff
Bound-is-easier
38Implementation
- The Emerac Information Gatherer
- written in Java
- incorporates rewriting and execution ordering
techniques - executes plans in parallel
- returns partial results during plan execution
- object oriented design makes it easy to modify
39EMERACs Contributions
- An approach for minimizing recursive information
gathering plans - An approach for ordering source calls in
information gathering plans - Attempts at minimizing both access cost and
tuple-transfer cost - (partial) Implementation Evaluation
What next??
40More capable sources
- EMERAC assumes sources can only do selection
processing. Real sources tend to provide more
capabilities - Many sources can do union queries on attributes
- E.g. CNN Stock quote tracker allows upto 8
symbols at a time - Some support constraints
- Give me all flights prices less than 300
- Theoretically, such sources can be modeled as
supplying a (possibly infinite) number of views. - Query optimization is harder when the
capabilities are neither full nor highly limited..
41More realistic overlap statistics
- LCWs may not be available (or may not be
advertised) - Statistics on coverage and overlap may be
available - Source A and Source B have 70 overlap on tuples
- How to use them?
- Computing unions given partial information about
intersections..
42Optimizing for First n-tuples
- Traditional techniques optimize time to get all
tuples. - It is much better to optimize time to
- get first n-tuples.
- Little theory available on such optimization
- May be counter-intuitive from the point of view
of traditional optimization - Use of double-pipe-lined hash join in TUKWILA
- Cost-quality tradeoffs (not all answers are
equal..)
Courtesy while you think. It saves time
Queen to Alice
43XML .
- Sources may give their output in XML format
- Makes unwrapping easy
- Sources may be based on XML
- Semi-structured non-relational data
- XML query processing languages
- Labeled directed graphs
- Navigational queries, path expressions etc..
44XML
HTML
ltPublication URL"ftp//db.stanford.edu/pub/papers
/xml.ps" Authors"RG JM JW"gt ltTitlegtFrom
Semistructured Data to XML Migrating the Lore
Data Model and Query Languagelt/Titlegt
ltPublishedgtProceedings of the 2nd International
Workshop on the Web and Databases (WebDB
'99)lt/Publishedgt ltPagesgt25-30lt/Pagesgt
ltLocationgt ltCitygtPhiladelphialt/Citygt
ltStategtPennsylvanialt/Stategt lt/Locationgt
ltDategt ltMonthgtJunelt/Monthgt
ltYeargt1999lt/Yeargt lt/Dategt lt/Publicationgt
ltPublication URL"ftp//db.stanford.edu/pub/pape
rs/ozone.ps" Authors"TL SA JW"gt
ltTitlegtOzone Integrating Structured and
Semistructured Datalt/Titlegt
ltPublishedgtTechnical Reportlt/Publishedgt
ltInstitutiongtStanford University Database
Grouplt/Institutiongt ltDategt
ltMonthgtOctoberlt/Monthgt ltYeargt1998lt/Yeargt
lt/Dategt lt/Publicationgt ltAuthor
ID"SA"gtS. Abiteboullt/Authorgt ltAuthor
ID"RG"gtR. Goldmanlt/Authorgt ltAuthor ID"TL"gtT.
Lahirilt/Authorgt ltAuthor ID"JM"gtJ.
McHughlt/Authorgt ltAuthor ID"JW"gtJ.
Widomlt/Authorgt
ltULgt ltLIgt R. Goldman, J. McHugh, and J.
Widom. ltA href"ftp//db.stanford.edu/pub/paper
s/xml.ps"gt From Semistructured Data to XML
Migrating the Lore Data Model and Query
Language lt/Agt. Proceedings of the 2nd
International Workshop on the Web and
Databases (WebDB '99), pages 25-30,
Philadelphia, Pennsylvania, June 1999.
ltLIgt T. Lahiri, S. Abiteboul, and J. Widom.
ltA href"ftp//db.stanford.edu/pub/papers/ozone.ps
"gt Ozone Integrating Structured and
Semistructured Data lt/Agt. Technical Report,
Stanford Database Group, October 1998.
lt/ULgt
45Current directions
- Integrate minimization source-call ordering
phases - Model cost-quality tradeoffs
- Handling run-time exceptions
- unavailability of sources etc.
- Tracking time and solution quality statistics
- Improve the granularity of the HTBP table
46The EMERAC Crowd
- Eric Lambrecht
- Senthil Gnanaprakasam
- Zaiqing Nie
- Yourself??
- Sharp
- Background in AI/DB
- Good Java hacking