Efficiently gathering information on the Internet using AI - PowerPoint PPT Presentation

About This Presentation

Title:

Efficiently gathering information on the Internet using AI

Description:

Dynamically generated in response to queries from users ... bursty traffic, setup delays etc. Hybrid. Selective materialization; Caching etc... – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 47

Provided by: unkn955

Learn more at: https://rakaposhi.eas.asu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficiently gathering information on the Internet using AI

1
Efficiently gathering information on the Internet
using AI DB techniquesThe EMERAC PROJECT
Subbarao Kambhampati Arizona State
University Tempe, USA rakaposhi.eas.asu.edu/yochan
.html
2
Motivation

A lot of data on the WWW is actually generated by
databases
80 of the web is hidden. Dynamically generated
in response to queries from users
Would be nifty if we can do database-style
querying of the web.

3
Information Gathering
user
Gatherer
wrapper
db
wrapper
lthtmlgt
cgi
4
Meta-search engines --Text-based --No access to
hidden web
5
Junglee
Netbot
Comparison shoppers --Call every source
--Collate results
DealPilot.Com
6

CORA allows more sophisticated queries
All papers that cite Rao, but are not written by
rao
Neither CORA nor DBLP are complete
CORA tends to be more complete for online papers
and AI papers
DBLP sticks to published papers, and is
more complete in DB coverage
Both sources provide Englishified-BIBTEX
citations

7
Data representation

Global as View (GAV)
The global (mediated) schema is written as a view
on the sources (databases)
Simpler query processing (not reformulation)
Less modular
(schema changes
when new sources are added)

Donaji

8
Data Representation --2

Local as View (LAV)
Modular
New sources can be added without changing the
global schema
Needs more sophisticated query processing
User query needs to be reformulated into source
calls
Compiling LAV into GAV

9
Data location

Warehousing vs. Virtual (on-line) sources
Warehousing avoids the problems of the net
Data may get stale.
There may be too much data.
You may not be allowed to shift the data over
Virtual Source method accesses the sources on
demand
Has to handle internet problems such as
bursty traffic, setup delays etc.
Hybrid
Selective materialization Caching etc...

10
Tricky issues

Sources are not really databases!
Legacy systems
Limited access patters
(Cans ask a white-pages source for the list of
all numbers)
Limited local processing power
Typically only selections (on certain attributes)
are supported
Sources are autonomous
Unregulated data overlap
Lack of full statistics on the sources

11
EMERAC Query Planning System
Build query plan using source inversion
Execution Optimizations Source call ordering
Logical OptimizationsRedundancy removal
Execute query plan
Duschka (with Genesereth Levy) 97
Optimization steps
12
Desirable Properties of information gathering
plans

Source-complete no other plan returns more
information using the available sources
Different from the traditional query
equivalence requirement
Source-minimal a plan for which no information
source can be removed, yet the plan returns the
same answer.
Access-cost minimal a plan which reduces the
number of separate accesses to individual sources
Bandwidth-minimal a plan that, when executed,
transfers the smallest amount of data over the
network yet is still source complete

13
Ensuring properties of optimal information
gathering plans
Build query plan
Logical Optimizations
Execution Optimizations
Execute query plan
Source completeness
Source-minimality
Access cost and bandwidth minimality
14
Modeling Information Gathering in EMERAC

Information sources
relational
answer select queries (possibly a restricted
set of query patterns)
autonomous
World model
relational
Query on the world model
Reformulate the query as calls on information
sources. Optimize. Execute.

Local as View model
15
Source Access Limitations

Sources can have a variety of access limitations
Form interfaced databases may require certain
attributes to be bound
Whitepages may require the name of the person
To get the numbers of a set of n people, we will
have to access the source n times
and may be unable to handle bindings of other
attributes
A Whitepages database may not take the address of
a person as a bound attribute
To get the number of John Doe, who lives on Lemon
St, we will have to get the numbers of all John
Does, and locally filter the ones not living on
Lemon Street
Wrapped web-pages cannot select over any
attributes

16
Representing Source Access Limitations

Use annotations on the attributes of the source
relation
annotation identifies attributes that must be
bound
annotation identifies un-selectable
attributes
S(X,Y,Z)
A form-interfaced web-page that requires bindings
for X and is able to do selections only on Z.
and annotations help identify feasible
binding patterns for sources
Sb-- are feasible Sf-- are infeasible
Sbbf must be modeled as Sbff filtered locally
with binding on Y

17
Modeling Sources
Sources related to world model by describing them
as views over world model --Source description
restricted to conjunctive queries (SPJ)
movie-hut(X, Y,Z) -gt title-time(X, Y),
title-actor(X, Z)
house-of-movies(X, Y, Z) -gt title-time(X, Y),
title-actor(X, Z)
query(X, Y) - title-time(X, Y)
Required binding..
18
Computing source-complete plans

Invert the source descriptions
Plans for individual world relations
Concatenate the query and the source inversion
rules
A datalog program which when executed will
return all accessible tuples

movie-hut(X, Y) -gt title-time(X, Y),
title-actor(X, Z)
title-time(X, Y) - movie-hut(X, Y,Z)
house-of-movies(X, Y) -gt title-time(X, Y),
title-actor(X, Z)
title-time(X, Y) - dom(X) , movie-hut(X, Y, Z)
19
Building Source Complete Plans
Duschka, Genesereth 97
query(X, Y) - title-time(X, Y)

movie-hut(X, Y) -gt title-time(X, Y),
title-actor(X, Z)

house-of-movies(X, Y) -gt title-time(X, Y),
title-actor(X, Z)
Source Inversion Rules
title-time(X, Y) -
movie-hut(X, Y,Z)
title-actor(X, Z) - movie-hut(X,
Y,Z)
dom(X) - movie-hut(X, Y,Z)
dom(Y) - movie-hut(X, Y,Z)
title-time(X, Y) - dom(X),
house-of-movies(X, Y)
ltX, f2(X, Y)gt
title-actor (X, X, Y) - dom(X),
house-of-movies(X, Y)
dom(Y) - dom(X), house-of-movies(X, Y)
query(X, Y) - title-time(X, Y)
Binding restrictions lead to recursion in the plan
20
Complexity of finding maximally-contained plans
(Certain answers)

Source inversion approach has poly-time
complexity for the case considered in EMERAC
Complexity doesnt depend on the query
Can handle recursive queries just as easily
Complexity does change if the sources are not
conjunctive queries
Sources as unions of conjunctive queries
(NP-hard)
Sources as recursive queries (Undecidable)
Comparison predicates
Complexity also changes based on Open vs. Closed
world assumption

21
Practical Problems with Plans derived from source
inversion rules

Every source that is remotely relevant to the
query is made part of the plan
Many of these sources may be overlapping

title-time(X, Y) -
movie-hut(X, Y) title-actor (Y, X, Y)
- movie-hut(X, Y)
dom(X) - movie-hut(X, Y)
dom(Y) - movie-hut(X, Y)
ltX, f1(X, Y)gt

If both movie-hut and house-of-movies have same
information
both sources are not necessary
the recursion is not necessary

title-time(X, Y) - dom(X),
house-of-movies(X, Y) title-actor (Y,
X, Y) - dom(X), house-of-movies(X, Y)
dom(Y) - dom(X),
house-of-movies(X, Y)
query(X,Y) - title-time(X, Y)
ltX, f2(X, Y)gt
22
Optimization challenges in EMERAC
Traditional
Information Gathering

Multiple sources export partial and overlapping
portions of a relation
Need to minimize plans to remove redundancy
Sources are rarely fully relational
Only limited types of queries allowed
Wrapped web-pages
Form-interfaced databases
Certain forms of join computation may be
precluded
Need to model query capabilities

Each relation is exported in to-to by a single
database
All sources are assumed to be fully relational

23
Minimizing information gathering plans

Model source overlaps
Use LCW statements
Rewrite the source-complete plan
Greedily remove rules from plan with uniform
equivalence and LCW statements ( make the plan
source-minimal)
Uniform containment checks Sagiv, 88
Use heuristics to guide removal and pull out
recursion first

24
LCW Statements
View movie-hut(X, Y) -gt title-time(X, Y),
title-actor(X, Z) LCW movie-hut(X, Y) lt-
title-time(X, Y), title-actor(X, Z) To check if
one rule, r , with information source predicates
contains another rule, r , see if r s s l
contains r s s v
1
2
1
2
movie-hut(X, Y) lt- title-time(X, Y),
title-actor(X, Z),ZAllen
Inter-source subsumption relations Mirror
sources can also be handled
Etzioni et al 97, Duschka 97
25
Testing for Uniform Containment
p(X, Y) - q(X, Y) q(X, Y) - r(X, Y)
uniformly contain
p(W, X) - r(W, X)
?
does
assert r(W, X) and try to derive p(W,
X) using bottom-up evaluation --Exponential
complexity...
26
Greedily Minimizing Information Gathering Plans

Remove non-recursive IDB predicates
Sort the rules so those with dom predicates come
before those without dom predicates
for each rule r do
let r be a rule of P that has not yet been
considered
let P be the program obtained by deleting rule r
from P
if Ps s l uniformly contains rs s
v then
replace P with P. Prune unreachable rules.

Source costs can be used

Uniform containment check is exponential in the
worst case
27
Minimization example
title-time(X, Y) -
movie-hut(X, Y)
ltX, f1(X, Y)gt
title-actor (X, X, Y) - movie-hut(X,
Y)
dom(X) - movie-hut(X, Y)
dom(Y) - movie-hut(X, Y)
title-time(X, Y) - dom(X),
house-of-movies(X, Y)
ltX, f2(X, Y)gt
title-actor (X, X, Y) - dom(X),
house-of-movies(X, Y)
dom(Y) - dom(X), house-of-movies(X, Y)
query(X, Y) - title-time(X, Y)
movie-hut(X, Y) lt- title-time(X, Y),
title-actor(X, Z)
28
LCW vs. Naïve Artificial Sources
29
EMERAC
Build query plan
Logical Optimizations
Execution Optimizations
Execute query plan
Source completeness
Source-minimality
Access cost and bandwidth minimality
30
Optimization challenges in EMERAC
Traditional
Information Gathering

Multiple sources export partial and overlapping
portions of a relation
Need to minimize plans to remove redundancy
Sources are rarely fully relational
Only limited types of queries allowed
Wrapped web-pages
Form-interfaced databases
Certain forms of join computation may be
precluded
Need to model query capabilities

Each relation is exported in to-to by a single
database
All sources are assumed to be fully relational

31
Continued
Optimization challenges in EMERAC

Tuple-transfer costs are assumed to dominate the
query-execution costs
Use of Bound-is-easier assumption
Assume availability of full source-statistics
Selectivity indices, histograms etc.

Access cost source latencies tend to equal or
dominate the transfer cost
Need to consider number of source calls
Need for considering bushy joins (instead of just
left-linear join trees)
Full statistics are rarely available about
internet sources
Sources are decentralized and autonomous
Difficult to do systematic optimization

32
Issues in ordering source calls

Execution cost is a function of both access cost
and the tuple-transfer cost (ignoring local
processing costs)
Tension between access costs traffic costs
E.g. Execute S1(W,X) S2(X,Y) where the query
binds W
Tuple-transfer cost reduction motivates calling
sources with the least general binding patterns
possible
Bound-is-easier (S1 first, and then feed X
bindings to S2)
Access cost reduction motivates calling sources
with the most general binding patterns possible
Feeding X bindings for S2 will generate many
separate accesses, increasing the access cost

33
Our Approach Assumptions

Exact optimization is not worth it
Lack of full source statistics
NP-hardness of the optimization problem
Join-ordering, which is a special case, is
already NP-Complete
Source access costs dominate tuple-transfer costs
by default
Reasonable given the large setup and latency
costs for internet sources

34
Our Approach Overview

A greedy approach (along the lines of
bound-is-easier type procedures)
By default, attempts to access each source with
the most general feasible binding pattern
Reasonable given the assumption that access costs
dominate transfer costs
The default is over-ridden if a binding pattern
is known to produce too much traffic
Binding patterns producing high traffic are
stored in a table called HTBP
Implicitly produces bushy join trees

35
The HTBP Table

The HTBP table contains, for every source S, the
least general binding patterns of S which are
known to produce high traffic
A call to source S with binding pattern B is
considered high-traffic producing, if HTBP
contains SB and B is either equal or more
general than B
E.g. Book(Author,Title,ISBN,Subj,Price,Pages)
HTBP may contain all binding patterns that do not
bind at least one of the first four attributes
Bookffffbb listed explicitly in HTBP
Bookfffffb Bookfffffbf Bookffffff would be
considered to be implicitly in HTBP
Advantage HTBP should be easy to specify even if
full source statistics are not available

36
The Algorithm
For each stage i from 1 to m do For each
unchosen subgoal S pick the most general
feasible BP B of S w.r.t.
V FBP such that B is not in HTBP.
If such a B exists, Push SB
into Ci. Mark S chosen. Add
all variables of S to V If no such B
exists, but there is a feasible binding pattern
for S Pick the BP B with most
bound variables (in terms of (.))
Push SB into Pi If no subgoal has
been chosen at this level (Ci is empty),
and there are some postponed
sources (Pi is non-empty) Choose
SkB in Pi with the maximum (B) value
Push SkB into Ci Add all
variables of Sk to V Return the array C1m

Default case Reduce accesses
HTBP case Reduce transfer costs
37
Example

Sources DP(AAuthor,TTitle,YYear)
SM98(TTitle,UURL)
Query Q(A,T,U,1998)
Plan Q(A,T,U,1998) - DP(A,T,1998)
SM98(T,U)

HTBP DPbbb SM98bb Step 1. VY Cand DPfff
DPffb SM98ff XX XX
XX P1 DPffb SM98ff C1
DPffb Step 2. VA,T,Y Cand SM98ff SM98bf
XX XX P2SM98bf
C2SM98bf
HTBP DPffb Step 1. VY Cand DPfff DPffb
SM98ff XX XX C1
SM98ff Step 2. VY, U, T Cand DPfff DPffb
DPfbf DPfbb XX XX
XX C2 DPfbf
HTBP Step 1. VY Cand DPfff DPffb
SM98ff C1 SM98ff
DPfff
Bound-is-easier
38
Implementation

The Emerac Information Gatherer
written in Java
incorporates rewriting and execution ordering
techniques
executes plans in parallel
returns partial results during plan execution
object oriented design makes it easy to modify

39
EMERACs Contributions

An approach for minimizing recursive information
gathering plans
An approach for ordering source calls in
information gathering plans
Attempts at minimizing both access cost and
tuple-transfer cost
(partial) Implementation Evaluation

What next??
40
More capable sources

EMERAC assumes sources can only do selection
processing. Real sources tend to provide more
capabilities
Many sources can do union queries on attributes
E.g. CNN Stock quote tracker allows upto 8
symbols at a time
Some support constraints
Give me all flights prices less than 300
Theoretically, such sources can be modeled as
supplying a (possibly infinite) number of views.
Query optimization is harder when the
capabilities are neither full nor highly limited..

41
More realistic overlap statistics

LCWs may not be available (or may not be
advertised)
Statistics on coverage and overlap may be
available
Source A and Source B have 70 overlap on tuples
How to use them?
Computing unions given partial information about
intersections..

42
Optimizing for First n-tuples

Traditional techniques optimize time to get all
tuples.
It is much better to optimize time to
get first n-tuples.
Little theory available on such optimization
May be counter-intuitive from the point of view
of traditional optimization
Use of double-pipe-lined hash join in TUKWILA
Cost-quality tradeoffs (not all answers are
equal..)

Courtesy while you think. It saves time
Queen to Alice
43
XML .

Sources may give their output in XML format
Makes unwrapping easy
Sources may be based on XML
Semi-structured non-relational data
XML query processing languages
Labeled directed graphs
Navigational queries, path expressions etc..

44
XML
HTML
ltPublication URL"ftp//db.stanford.edu/pub/papers
/xml.ps" Authors"RG JM JW"gt ltTitlegtFrom
Semistructured Data to XML Migrating the Lore
Data Model and Query Languagelt/Titlegt
ltPublishedgtProceedings of the 2nd International
Workshop on the Web and Databases (WebDB
'99)lt/Publishedgt ltPagesgt25-30lt/Pagesgt
ltLocationgt ltCitygtPhiladelphialt/Citygt
ltStategtPennsylvanialt/Stategt lt/Locationgt
ltDategt ltMonthgtJunelt/Monthgt
ltYeargt1999lt/Yeargt lt/Dategt lt/Publicationgt
ltPublication URL"ftp//db.stanford.edu/pub/pape
rs/ozone.ps" Authors"TL SA JW"gt
ltTitlegtOzone Integrating Structured and
Semistructured Datalt/Titlegt
ltPublishedgtTechnical Reportlt/Publishedgt
ltInstitutiongtStanford University Database
Grouplt/Institutiongt ltDategt
ltMonthgtOctoberlt/Monthgt ltYeargt1998lt/Yeargt
lt/Dategt lt/Publicationgt ltAuthor
ID"SA"gtS. Abiteboullt/Authorgt ltAuthor
ID"RG"gtR. Goldmanlt/Authorgt ltAuthor ID"TL"gtT.
Lahirilt/Authorgt ltAuthor ID"JM"gtJ.
McHughlt/Authorgt ltAuthor ID"JW"gtJ.
Widomlt/Authorgt
ltULgt ltLIgt R. Goldman, J. McHugh, and J.
Widom. ltA href"ftp//db.stanford.edu/pub/paper
s/xml.ps"gt From Semistructured Data to XML
Migrating the Lore Data Model and Query
Language lt/Agt. Proceedings of the 2nd
International Workshop on the Web and
Databases (WebDB '99), pages 25-30,
Philadelphia, Pennsylvania, June 1999.
ltLIgt T. Lahiri, S. Abiteboul, and J. Widom.
ltA href"ftp//db.stanford.edu/pub/papers/ozone.ps
"gt Ozone Integrating Structured and
Semistructured Data lt/Agt. Technical Report,
Stanford Database Group, October 1998.
lt/ULgt
45
Current directions