Information

About This Presentation

Title:

Information

Description:

The end-game: 3 options. Have an in-class final exam. 5/8 2:30pm is the designated time ... SIMS/Ariadne (USC/ISI) Emerac/Havasu (ASU) BibFinder (ASU) ... – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 44

Provided by: subbraoka

Learn more at: https://rakaposhi.eas.asu.edu

Category:

Tags: information

more less

Transcript and Presenter's Notes

Title: Information

1
Information Data Integration

Combining information from multiple autonomous
sources

2
The end-game 3 options

Have an in-class final exam
5/8 230pm is the designated time
Have a take-home exam
Make the final home-work into a take home
..and have a mandatory discussion class on 5/8
230pm

Also, note the change in demo schedule
3
Todays Agenda

Discuss Semtag/Seeker
Lecture start on Information Integration

4
Information Integration

Combining information from multiple autonomous
information sources
And answering queries using the combined
information
Many variations depending on
The type of information sources (text? Data?
Combination?)
Data vs. Information integration
Horizontal vs. Vertical integration
The level of eagerness of the integration
Ad hoc vs. Pre-mediated integration
Pre-mediation itself can be Warehouse vs online
approaches
Generality of the solution
Mashup vs. Mediated

5
Information Integrationas making the database
repositoryof the web..
The IR View
6
The DB View Integration as Mediation
over Autonomous databases
7
Who is dying to have it? (Applications)
Skeptics corner

WWW
Comparison shopping
Portals integrating data from multiple sources
B2B, electronic marketplaces
Science and culture
Medical genetics integrating genomic data
Astrophysics monitoring events in the sky.
Culture uniform access to all cultural databases
produced by countries in Europe provinces in
Canada
Enterprise data integration
An average company has 49 different databases and
spends 35 of its IT dollars on integration
efforts

8
Isnt web mostly text?
Skeptics corner

The invisible web is mostly structured
Most web servers have back end database servers
They dynamically convert (wrap) the structured
data into readable english
ltIndia, New Delhigt gt The capital of India is
New Delhi.
So, if we can unwrap the text, we have
structured data!
(un)wrappers, learning wrappers etc
Note also that such dynamic pages cannot be
crawled...
The (coming) Semi-structured web
Most pages are at least semi-structured
XML standard is expected to ease the
presenatation/on-the-wire transfer of such pages.
(BUT..)
The Services
Travel services, mapping services
The Sensors
Stock quotes, current temperatures, ticket prices

9
Is it like Expedia/Travelocity/Orbitz
Skeptics corner

Surpringly, NO!
The online travel sites dont quite need to do
data integration they just use SABRE
SABRE was started in the 60s as a joint project
between American Airlines and IBM
It is the de facto database for most of the
airline industry (who voluntarily enter their
data into it)
There are very few airlines that buck the SABRE
trendSouthWest airlines is one (which is why
many online sites dont bother with South West)
So, online travel sites really are talking to a
single database (not multiple data sources)
To be sure, online travel sources do have to
solve a hard problem. Finding an optimal fare
(even at a given time) is basically
computationally intractable (not to mention the
issues brought in by fluctuating fares). So dont
be so hard on yourself
Check out http//www.maa.org/devlin/devlin_09_02.h
tml

10
Why isnt this just
Skeptics corner

Search engines do text-based retrieval of URLS
Works reasonably well for single document texts,
or for finding sites based on single document
text
Cannot integrate information from multiple
documents
Cannot do effective query relaxation or
generalization
Cannot link documents and databases
The aim of Information integration is to support
query processing over structured and
semi-structured sources as well as services.

11
Are we talking comparison shopping agents?

Certainly closer to the aims of these
But
Wider focus
Consider larger range of databases
Consider services
Implies more challenges
warehousing may not work
Manual source characterization/ integration wont
scale-up

12
(No Transcript)
13
4/26

Information Integration 2
Focus on Data Integration

14
Information Integration
Text Integration
Data Integration
Collection Selection
Soft-Joins
Data aggregation (vertical integration)
Data Linking (horizontal integration)
15
Different Integration scenarios

Collection Selection
All sources export text documents
E.g. meta-crawler etc.
Challenges Similarity definition relevance
handling

Data Aggregation (Vertical)
All sources export (parts of a) single relation
No need for joins etc
Could be warehouse or virtual
E.g. BibFinder, Junglee, Employeds etc
Challenges Schema mapping data overlap

Data Linking (Horizontal)
Joins over multiple relations stored in multiple
DB
E.g. Softjoins in WHIRL
Ted Kennedy episode
Challenges record linkage over text fields
(object mapping) query reformulation

All together (vertical horizontal)
Many interesting research issues
..but few actual fielded systems

16
Collection Selection
17
Collection Selection/Meta Search Introduction

Metasearch Engine
A system that provides unified access to multiple
existing search engines.
Metasearch Engine Components
Database Selector
Identifying potentially useful databases for
each user query
Document Selector
Identifying potentially useful document returned
from selected databases
Query Dispatcher and Result Merger
Ranking the selected documents

18
Collection Selection
19
Evaluating collection selection

Let c1..cj be the collections that are chosen to
be accessed for the query Q. Let d1dk be the top
documents returned from these collections.
We compare these results to the results that
would have been returned from a central union
database
Ground Truth The ranking of documents that the
retrieval technique (say vector space or jaccard
similarity) would have retrieved from a central
union database that is the union of all the
collections
Compare precision of the documents returned by
accessing

20
Existing Approaches in Database Selection

Rough Representative Approaches
Using a few keywords or sentences to describe the
contents of the database
Often manually generated
Statistical Representative Approaches
Contains statistical information for each term in
databases (such as the document frequency and
average weight of the term).
Scalability issues due to large amount of
statistics for each db.
Learning-based approaches
Using past retrieval experience to determine the
usefulness for future queries
Training queries / real user queries
The obtained experience against a database will
be saved as the representative of the database

21
General Scheme Challenges

Get a representative of each of the database
Representative is a sample of files from the
database
Challenge get an unbiased sample when you can
only access the database through queries.
Compare the query to the representatives to judge
the relevance of a database
Coarse approach Convert the representative files
into a single file (super-document). Take the
(vector) similarity between the query and the
super document of a database to judge that
databases relevance
Finer approach Keep the representative as a
mini-database. Union the mini-databases to get a
central mini-database. Apply the query to the
central mini-database and find the top k answers
from it. Decide the relevance of each database
based on which of the answers came from which
databases representative
You can use an estimate of the size of the
database too
What about overlap between collections? (See
ROSCO paper)

22
Uniform Probing for Content Summary Construction

Automatic extraction of document frequency
statistics from uncooperative databases
Callan and Connell TOIS 2001,Callan et al.
SIGMOD 1999
Main Ideas
Pick a word and send it as a query to database D
RandomSampling-OtherResource(RS-Ord) from a
dictionary
RandomSampling-LearnedResource(RS-Lrd) from
retrieved documents
Retrieval the top-K documents returned
If the number of retrieved documents exceeds a
threshod T, stop, otherwise retart at the
beginning
k4 , T300
Compute the sample document frequency for each
word that appeared in a retrieved document.

23
CORI Net Approach (Representative as a super
document)

Representative Statistics
The document frequency for each term and each
database
The database frequency for each term
Main Ideas
Visualize the representative of a database as a
super document, and the set of all representative
as a database of super documents
Document frequency becomes term frequency in the
super document, and database frequency becomes
document frequency in the super database
Ranking scores can be computed using a similarity
function such as the Cosine function

24
ReDDE Approach(Representative as a
mini-collection)

Use the representatives as mini collections
Construct a union-representative that is the
union of the mini-collections (such that each
document keeps information on which collection it
was sampled from)
Send the query first to union-collection, get the
top-k ranked results
See which of the results in the top-k came from
which mini-collection. The collections are ranked
in terms of how much their mini-collections
contributed to the top-k answers of the query.
Scale the number of returned results by the
expected size of the actual collection

25
Data Integration
26
Models for Integration

Modified from Alon Halevys slides

27
Solutions for small-scale integration

Mostly ad-hoc programming create a special
solution for every case pay consultants a lot of
money.
Data warehousing load all the data periodically
into a warehouse.
6-18 months lead time
Separates operational DBMS from decision support
DBMS. (not only a solution to data integration).
Performance is good data may not be fresh.
Need to clean, scrub you data.

Junglee did this, for employment classifieds
28
Data Warehouse Architecture
OLAP / Decision support/ Data cubes/ data mining
User queries
Junglee did this, for employment classifieds
Relational database (warehouse)
Data extraction programs
Data cleaning/ scrubbing
Data source
Data source
Data source
29
The Virtual Integration Architecture

Leave the data in the sources.
When a query comes in
Determine the relevant sources to the query
Break down the query into sub-queries for the
sources.
Get the answers from the sources, and combine
them appropriately.
Data is fresh. Approach scalable
Issues
Relating Sources Mediator
Reformulating the query
Efficient planning execution

Garlic IBM, HermesUMDTsimmis,
InfoMasterStanford DISCOINRIA Information
Manifold ATT SIMS/AriadneUSCEmerac/HavasuA
SU
30
Virtual Integration Architecture
Sources can be relational, hierarchical (IMS),
structure files, web sites.
31
Research Projects

Garlic (IBM),
Information Manifold (ATT)
Tsimmis, InfoMaster (Stanford)
The Internet Softbot/Razor/Tukwila (UW)
Hermes (Maryland)
DISCO (INRIA, France)
SIMS/Ariadne (USC/ISI)
Emerac/Havasu (ASU)
BibFinder (ASU)

32
Desiderata for Relating Source-Mediator Schemas

Expressive power distinguish between sources
with closely related data. Hence, be able to
prune access to irrelevant sources.
Easy addition make it easy to add new data
sources.
Reformulation be able to reformulate a user
query into a query on the sources efficiently and
effectively.
Nonlossy be able to handle all queries that can
be answered by directly accessing the sources

Reformulation
33
11/9

--Google Mashupswhat type of integration are
they?
--Mturk.com?collaborative computing the
capitalist way

34
Why isnt this just
Skeptics corner
Databases
Distributed Databases

No common schema
Sources with heterogeneous schemas (and
ontologies)
Semi-structured sources
Legacy Sources
Not relational-complete
Variety of access/process limitations
Autonomous sources
No central administration
Uncontrolled source content overlap
Unpredictable run-time behavior
Makes query execution hard
Predominantly Read-only
Could be a blessingless worry about transaction
management
(although the push now is to also support
transactions on web)

35
Reformulation Problem

Given
A query Q posed over the mediated schema
Descriptions of the data sources
Find
A query Q over the data source relations, such
that
Q provides only correct answers to Q, and
Q provides all possible answers to Q given the
sources.

36
Approaches for relating source Mediator Schemas
Differences minor for data aggregation

Global-as-view (GAV) express the mediated schema
relations as a set of views over the data source
relations
Local-as-view (LAV) express the source relations
as views over the mediated schema.
Can be combined?

Lets compare them in a movie Database
integration scenario..
37
Example Scenario

A mediator for movie databases
Want to provide the information about movies and
where they are playing

38
Global-as-View

Mediated schema
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create View Movie AS
select from S1 S1(title,dir,year,genre)
union
select from S2 S2(title,
dir,year,genre)
union S3(title,dir),
S4(title,year,genre)
select S3.title, S3.dir, S4.year, S4.genre
from S3, S4
where S3.titleS4.title

Express mediator schema relations as views
over source relations
39
Global-as-View

Mediated schema
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create View Movie AS
select from S1 S1(title,dir,year,genre)
union
select from S2 S2(title,
dir,year,genre)
union S3(title,dir),
S4(title,year,genre)
select S3.title, S3.dir, S4.year, S4.genre
from S3, S4
where S3.titleS4.title

Express mediator schema relations as views
over source relations
Mediator schema relations are Virtual views on
source relations
40
Global-as-View Example 2

Mediated schema
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create View Movie AS S1(title,dir,year)
select title, dir, year, NULL
from S1
union S2(title,
dir,genre)
select title, dir, NULL, genre
from S2

Express mediator schema relations as views
over source relations
Null values
41
Global-as-View Example 2

Mediated schema
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Source S4 S4(cinema, genre)

Express mediator schema relations as views
over source relations
Create View Movie AS select NULL, NULL, NULL,
genre from S4 Create View Schedule AS
select cinema, NULL, NULL from S4. But what
if we want to find which cinemas are playing
comedies?
Lossy Mediation
42
Local-as-View example 1

Mediated schema
Movie(title, dir, year, genre),
Schedule(cinema, title, time).

Express source schema relations as views
over mediator relations
43
Mediated schema Movie(title, dir, year,
genre), Schedule(cinema, title, time).
Express source schema relations as views
over mediator relations
S1(title,dir,year,genre)
S3(title,dir)
S5(title,dir,year), year gt1960
S4(Cinema,Genre)
44
Local-as-View Example 2

Mediated schema
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Source S4 S4(cinema, genre)

Express source schema relations as views
over mediator relations
Create Source S4 select cinema, genre from
Movie m, Schedule s where m.titles.title
Now if we want to find which cinemas are playing
comedies, there is hope!
45
GAV vs. LAV

Mediated schema
Movie(title, dir, year, genre),
Schedule(cinema, title, time).

Source S4 S4(cinema, genre)
Lossy mediation
46
GAV vs. LAV

Not modular
Addition of new sources changes the mediated
schema
Can be awkward to write mediated schema without
loss of information
Query reformulation easy
reduces to view unfolding (polynomial)
Can build hierarchies of mediated schemas
Best when
Few, stable, data sources
well-known to the mediator (e.g. corporate
integration)
Garlic, TSIMMIS, HERMES

Modular--adding new sources is easy
Very flexible--power of the entire query language
available to describe sources
Reformulation is hard
Involves answering queries only using views (can
be intractablesee below)
Best when
Many, relatively unknown data sources
possibility of addition/deletion of sources
Information Manifold, InfoMaster, Emerac, Havasu

47
Reformulation in LAV

Given a set of views V1,,Vn, and a query Q, can
we answer Q using only the answers to V1,,Vn?
Notice the use of materialized (pre-computed)
views
ONLY materialized views should be used
Approaches
Bucket algorithm Levy 96
Inverse rules algorithm Duschka, 99
Hybrid versions
SV-Bucket 2001, MiniCon 2001

48
Reformulation in LAV The issues
Query Find all the years in which Zhang Yimou
released movies. Select year from movie M
where M.diryimou Q(y) - movie(T,D,Y,G),Dyimou
Not executable
Which is the better plan? What are we looking
for? --equivalence? --containment?
--Maximal Containment --Smallest plan?
Q(y) - S1(T,D,Y,G) , Dyimou (1)
Q(y) - S1(T,D,Y,G) , Dyimou Q(y) - S5(T,D,Y)
, Dyimou (2)
49
Maximal Containment

Query plan should be sound and complete
Sound implies that Query should be contained in
the Plan (I.e., tuples satisfying the query are
subset of the tuples satisfying the plan
Completeness?
Traditional DBs aim for equivalence
Query contains the plan Plan contains the query
Impossible
We want all query tuples that can be extracted
given the sources we have
Maximal containment (no other query plan, which
contains this plan is available)

P contains Q if P Q (exponential even for
conjunctive (SPJ) query plans)
50
The world of Containment

Consider Q1(.) - B1(.) Q2(.) - B2(.)
Q1 ? Q2 (contained in) if the answer to Q1 is
a subset of Q2
Basically holds if B1(x) B2(x)
Given a query Q, and a query plan Q1,
Q1 is a sound query plan if Q1 is contained in Q
Q1 is a complete query plan if Q is contained in
Q1
Q1 is a maximally contained query plan if there
is no Q2 which is a sound query plan for Q1, such
that Q1 is contained in Q2

51
Computing Containment Checks

Consider Q1(.) - B1(.) Q2(.) - B2(.)
Q1 ? Q2 (contained in) if the answer to Q1 is
a subset of Q2
Basically holds if B1(x) B2(x)
(but entailment is undecidable in general boo
hoo)
Containment can be checked through containment
mappings, if the queries are conjunctive
(select/project/join queries, without
constraints) ONLY EXPONENTIAL!!!--arent you
relieved?
m is a containment mapping from Vars(Q2) to
Vars(Q1) if
m maps every subgoal in the body of Q2 to a
subgoal in the body of Q1
m maps the head of Q2 to the head of Q1
Eg Q1(x,y) - R(x), S(y), T(x,y) Q2(u,v) -
R(u), S(v)
Containment mapping u/x v/y

52
Reformulation Algorithms
Bucket Algorithm
Inverse Rules
Q(.) - V1() V2()
S11 S12 S00
S21 S22 S00
V1() - S11() V1() - S12() V1() - S00() V2() -
S21() V2() - S22() V2() - S00()

Bucket algorithm
Cartesian product of buckets
Followed by containment check

Inverse Rules
plan fragments for mediator relations

P1 contains P2 if P2 P1
Levy
Duschka
53
Q(C,G) - Movie(T,D,Y,G) ,
Schedule(C,T,Ti)
Skolemize Sfs5-1(T,D,Y)
Movie(T,D,Y,G) - S1(T,D,Y,G) Movie(T,D,?y,?g )
- S3(T,D) Movie(T,D,Y,?g) - S5(T,D,Y),
ygt1960 Movie(?t,?d,?y,G) - S4(C,G) Schedule(C,?t
,?ti) - S4(C,G)
S1(T,D,Y,G) S3(T,D) S5(T,D,Y) S4(C,G)
S4(C,G)
Movie(T,D,Y,G)
Sched(C,T,Ti)
Skolemize Sfs4-1(G,C)
54
Take-home vs. In-class(The Tastes Great/Less
Filling debate of CSE 494)

More time consuming
To set, to take and to grade
Caveat people tend to over-estimate the time
taken to do the take-home since the dont factor
out the preparation time that is interleaved
with the exam time
..but may be more realistic
Probably will be given on Th 6th and will be due
by 11th evening. (/-)

Less time-consuming
To take (sort of like removing a bandage..)
and definitely to grade ?
Scheduled to be on Tuesday, May 11th 240430

55
Interactive Review on 5/4

A large (probably most?) of the class on 5/4 will
be devoted to interactive semester review
Everyone should come prepared with at least 3
topics/ideas that they got out of the class
You will be called in random order and you should
have enough things to not repeat what others
before you said.
What you say will then be immortalized on the
class homepage
See the old acquired wisdom page on the class
page.

56
Bucket Algorithm Populating buckets

For each subgoal in the query, place relevant
views in the subgoals bucket
Inputs
Q(x)- r1(x,y) r2(y,x)
V1(a)-r1(a,b)
V2(d)-r2(c,d)
V3(f)- r1(f,g) r2(g,f)

Buckets
57
Combining Buckets

For every combination in the Cartesian products
from the buckets, check containment in the query

Q(x)- r1(x,y) r2(y,x)

Candidate rewritings
Q1(x) - V1(x) V2(x) X
Q2(x) - V1(x) V3(x) X
Q3(x) - V3(x) V2(x) X
Q4(x) - V3(x) V3(x)

Bucket Algorithm will check all possible
combinations
Buckets
r1(x,y)
r2(y,x)
58
Complexity of finding maximallycontained plans
in LAV

Complexity does change if the sources are not
conjunctive queries
Sources as unions of conjunctive queries
(NP-hard)
Disjunctive descriptions
Sources as recursive queries (Undecidable)
Comparison predicates
Complexity is less dependent on the query
Recursion okay but inequality constraints lead
to NP-hardness
Complexity also changes based on Open vs. Closed
world assumption

Big Two
You can reduce the complexity by taking a
conjunctive query that is an upperbound.
?This just pushes the complexity to
minimization phase
Advertised description Toyota U Olds
True source contents (of Big Two)
Advertised description All cars
Abiteboul Duschka, 98
59
Is XML standardization a magical solution for
Integration?

If all WEB sources standardize into XML format
Source access (wrapper generation issues) become
easier to manage
BUT all other problems remain
Still need to relate source (XML)schemas to
mediator (XML)schema
Still need to reason about source overlap, source
access limitations etc.
Still need to manage execution in the presence of
source/network uncertainities

60
Query Optimization Challenges
-- Deciding what to optimize --Getting the
statistics on sources --Doing the
optimization We will first talk about
reformulation level challenges
61
Practical issues complicating Reformulation

Sources may have access limitations
Access restrictions can lead to recursive
rewritings even when the queries are
non-recursive!
Sources may have overlap
Non-minimal rewritings may result when overlap
information is ignored

62
Source Limitations

Sources are not really fully-relational databases
Legacy systems
Limited access patters
(Cans ask a white-pages source for the list of
all numbers)
Limited local processing power
Typically only selections (on certain attributes)
are supported
Access limitations modeled in terms of allowed
(feasible) binding patterns with which the
source can be accessed
E.g. S(X,Y,Z) with feasible patterns f,f,b or
b,b,f

63
Access Restrictions Recursive Reformulations

Create Source S1 as
select
from Cites
given paper1
Create Source S2 as
select paper
from ASU-Papers
Create Source S3 as
select paper
from AwardPapers
given paper
Query select from AwardPapers

S1bf(p1,p2) - cites(p1,p2) S2(p) -
Asp(p) S3b(p) - Awp(p) Q(p) - Awp(p) Awp(p)
- S3b(p) Asp(p) -
S2(p) Cites(p1,p2) - S1bf(p)
Dom(p) - S2(p) Dom(p) - Dom(p1), S1(p1,p)
KwokWeld, 96 Duschka Levy, 97
64
Managing Source Overlap

Often, sources on the Internet have overlapping
contents
The overlap is not centrally managed (unlike
DDBMSdata replication etc.)
Reasoning about overlap is important for plan
optimality
We cannot possibly call all potentially relevant
sources!
Qns How do we characterize, get and exploit
source overlap?
Qualitative approaches (LCW statements)
Quantitative approaches (Coverage/Overlap
statistics)

65
Local Completeness Information

If sources are incomplete, we need to look at
each one of them.
Often, sources are locally complete.
Movie(title, director, year) complete for years
after 1960, or for American directors.
Question given a set of local completeness
statements, is a query Q a complete answer to Q?

Problems 1. Sources may not be
interested in giving these! ?Need to learn
?hard to learn! 2. Even if sources are
willing to give, there may not be any
big enough LCWs Saying I
definitely have the car with
vehicle ID XXX is useless
Advertised description
True source contents
Guarantees (LCW Inter-source comparisons)
66
Quantitative ways of modeling inter-source overlap

Coverage Overlap statistics Koller et. al.,
97
S1 has 80 of the movies made after 1960 while
S2 has 60 of the movies
S1 has 98 of the movies stored in S2
Computing cardinalities of unions given
intersections

Who gives these statistics? -Third party
-Probing
67
Extremes of Laziness in Data Integration

Fully Query-time II (blue sky for now)
Get a query from the user on the mediator schema
Go discover relevant data sources
Figure out their schemas
Map the schemas on to the mediator schema
Reformulate the user query into data source
queries
Optimize and execute the queries
Return the answers

Fully pre-fixed II
Decide on the only query you want to support
Write a (java)script that supports the query by
accessing specific (pre-determined) sources,
piping results (through known APIs) to specific
other sources
Examples include Google Map Mashups

(most interesting action is in between)
E.g. We may start with known sources and their
known schemas, do hand-mapping and support
automated reformulation and optimization
68
(No Transcript)
69

User queries refer to the mediated schema.
Data is stored in the sources in a local schema.
Content descriptions provide the semantic
mappings between the different schemas.
Mediator uses the descriptions to translate user
queries into queries on the sources.

DWIM
70
Dimensions to Consider

How many sources are we accessing?
How autonomous are they?
Can we get meta-data about sources?
Is the data structured?
Discussion about soft-joins. See slide next
Supporting just queries or also updates?
Requirements accuracy, completeness,
performance, handling inconsistencies.
Closed world assumption vs. open world?
See slide next

71
Soft Joins..WHIRL Cohen

We can extend the notion of Joins to Similarity
Joins where similarity is measured in terms of
vector similarity over the text attributes. So,
the join tuples are output n a ranked formwith
the rank proportional to the similarity
Neat idea but does have some implementation
difficulties
Most tuples in the cross-product will have
non-zero similarities. So, need query processing
that will somehow just produce highly ranked
tuples
Also other similarity/distance metrics may be
used
E.g. Edit distance

72
(No Transcript)
73
Local Completeness Information

If sources are incomplete, we need to look at
each one of them.
Often, sources are locally complete.
Movie(title, director, year) complete for years
after 1960, or for American directors.
Question given a set of local completeness
statements, is a query Q a complete answer to Q?

User queries refer to the mediated schema.
Data is stored in the sources in a local schema.
Content descriptions provide the semantic
mappings between the different schemas.
Mediator uses the descriptions to translate user
queries into queries on the sources.

Schema Template for the stored data
75
Source Descriptions

Contains all meta-information about the sources
Logical source contents (books, new cars).
Source capabilities (can answer SQL queries)
Source completeness (has all books).
Physical properties of source and network.
Statistics about the data (like in an RDBMS)
Source reliability
Mirror sources
Update frequency.

76
Source Access
We just did this!

How do we get the tuples?
Many sources give unstructured output
Some inherently unstructured while others
englishify their database-style output
Need to (un)Wrap the output from the sources to
get tuples
Wrapper building/Information Extraction
Can be done manually/semi-manually

77
Source Fusion/Query Planning

Accepts user query and generates a plan for
accessing sources to answer the query
Needs to handle tradeoffs between cost and
coverage
Needs to handle source access limitations
Needs to reason about the source
quality/reputation

78
Monitoring/Execution

Takes the query plan and executes it on the
sources
Needs to handle source latency
Needs to handle transient/short-term network
outages
Needs to handle source access limitations
May need to re-schedule or re-plan

79
Integrator vs. DBMS
Reprise

No common schema
Sources with heterogeneous schemas
Semi-structured sources
Legacy Sources
Not relational-complete
Variety of access/process limitations
Autonomous sources
No central administration
Uncontrolled source content overlap
Lack of source statistics
Tradeoffs between query plan cost, coverage,
quality etc.
Multi-objective cost models
Unpredictable run-time behavior
Makes query execution hard
Presence of services
Need to compose services

vs.
80
Deductive Databases

Relations viewed as predicates.
Interrelations between relations expressed as
datalog rules
(Horn clauses, without function symbols)
Queries correspond to datalog programs
Conjunctive queries are datalog programs with a
single non-recursive rule Correspond to SPJ
queries in SQL
Emprelated(Name,Dname) - Empdep(Name,Dname)
Emprelated(Name,Dname) - Empdep(Name,D1),
Emprelated(D1,Dname)

EDB predicate
IDB predicate
81
BibFinder Case Study

See the bibfinder slides

82
What to Optimize

Traditional DB optimizers compare candidate plans
purely in terms of the time they take to produce
all answers to a query.
In Integration scenarios, the optimization is
multi-objective
Total time of execution
Cost to first few tuples
Often, the users are happier with plans that give
first tuples faster
Coverage of the plan
Full coverage is no longer an iron-clad
requirement
Too many relevant sources, Uncontrolled overlap
between the sources
Cant call them all!
(Robustness,
Access premiums)

83
Source Statistics Needed

The size of the source relation and attributes
The length and cardinality of the attributes
the cardinality of the source relation
The feasible access patterns for the source
The network bandwidth and latency between the
source and the integration system
Coverage of the source S for a relation R denoted
by P(SR)
Overlap between sources P(S1..Sk R)

Extension of R
84
Getting the Statistics

Since the sources are autonomous, the mediator
needs to actively gather the relevant statistics
Learning bandwidth and latency statistics
Gruser et. al. 2000 use neural networks to
learn the response time patterns of web sources
Can learn the variation of response times across
the days of the week and across the hours of the
day.
Learning coverages and overlaps
Nie et. al. 2002 use itemset mining techniques
to learn compact statistics about the spread of
the mediator schema relations across the
accessible sources
Can trade quality of the statistics for reduced
space consumption

85
Using LCW rules to minimize plans

Basic Idea
--If reformulation of Q leads to a union of
conjunctive plans
P1UP2UPk
--then, if P1 is complete for Q (under the
given LCW information), then we can minimize the
reformulation by pruning P2Pk
-- P1?LCW contains P1UP2UPk
For Recursive Plans (obtained when the sources
have access restrictions)
--We are allowed to remove a rule r from a
plan P, if the complete version of r is already
contained in P-r

Duschka, AAAI-97
Emerac Lambrecht Kambhampati, 99
86
Example
Advertised description
True source contents
Minimization harder in the presence of source
limitations
Guarantees
S1
S1(T,D,Y) - M(T,D,Y) S2(T,Th,C,H) -
Sh(T,Th,C,H) LCW S2(T,Th,C,H) - Sh(T,Th,C,H)
CSeattle S2(T,Th,C,H) - Sh(T,Th,C,H)
S2/3
Q(t,d) - M(T,D,Y) Sh(T,Th,C,H) C Seattle
Query Q(t,d) - S1(T,D,Y) S2( T,Th,C,H)
CSeattle Plan1 Q(t,d) - M(T,D,Y) S2(
T,Th,C,H) CSeattle Plan2
Levy, 96 Duschka, 97 Lambrecht Kambhampati,
99
87
Example
Advertised description
True source contents
Minimization harder in the presence of source
limitations
Guarantees
S1
S1(T,D,Y) - M(T,D,Y) LCW S1(T,D,Y) -
M(T,D,Y), Ygt 1960 S2(T,Th,C,H) - Sh(T,Th,C,H)
S2
Q(t,d) - M(T,D,Y) Sh(T,Th,C,H) C Seattle
Query Q(t,d) - S1(T,D,Y) S2( T,Th,C,H)
CSeattle Plan Q(t,d) - M(T,D,Y)
Sh(T,Th,C,H) Cseattle Ygt1960
The part of the plan
that is complete
Levy, 96 Duschka, 97 Lambrecht Kambhampati,
99
88
Approaches for handling multiple objectives

Do staged optimization
Information Manifold Optimize for coverage, and
then for cost
Do joint optimization
Generate all the non-dominated solutions
(Pareto-Set)
Combine the objectives into a single metric
e.g. Havasu/Multi-R
Cost increases additively
Coverage decreases multiplicatively
utility(p) wlog(coverage(p)) -
(1-w)cost(p)
The logarithm ensures coverage additiveCandan
01

89
Learning Coverage/Overlap Statistics
Challenge Impractical to learn and store all
the statistics for every query.

StatMiner A threshold based hierarchical
association rule mining approach
Learns statistics with respect to query
classes rather than specific queries
Defines query classes in terms of attribute-value
hierarchies
Discovers frequent query classes and limits
statistics to them
Maps a users query into its closest ancestor
class, and uses the statistics of the mapped
class to estimate the statistics of the query.
Handles the efficiency and accuracy tradeoffs by
adjusting the thresholds.

Havasu Nie et. al. 2002
90
Alternatives to statistics-based approaches

Avoid cost-based optimization and depend on
qualitative easier to get statistics
Instead of coverage/overlap statistics use
LCW characterizations
mirror source characterizations
Eg RAZOR, INFORMATION MANIFOLD, EMERAC
Instead of tuple transfer statistics use
bound is easier assumption
high-traffic binding pattern information
Eg. INFORMATION MANIFOLD, EMERAC, Yerneni et.
al.

91
Staged Optimization of Cost Coverage

Ranking and choosing top N plansDoan et. al
ICDE02

Finding a physical plan for for each selected plan

Problem Cost and Coverage are interrelated..
92
Joint Optimization of Cost Coverage

Havasu/Multi-R Nie et. al. 2001
Search in the space of parallel plans
Each subplan in the parallel plan contains (a
subset of) sources relevant for a subgoal
Dynamic programming is used to search among the
subgoal orders
Greedy approach is used to create a subplan for a
particular subgoal
Keep adding sources until the utility (defined in
terms of cost and coverage) starts to worsen
Capable of generating plans for a variety of
cost/coverage tradeoffs

Increasing relative weight of coverage
93
Techniques for optimizing response time for first
tuples

Staged approach Generate plans based on other
objectives and post-process them to improve their
response time for first-k tuples
Typical idea is to replace asymmetric operators
with symmetric ones
e.g. replace nested-loop join with symmetric hash
join
e.g. Telegraph, Tukwila, Niagara
Problem Access limitations between sources may
disallow symmetric operations
Solution Use joint optimization approach (e.g.
Havasu) and consider the cost of first tuples as
a component of plan utility
Viglas Naughton, 2002 describe approaches for
characterizing the rate of answer delivery
offered by various query plans.