Title: Data integration
1Data integration
- Most slides are borrowed from Li Chen
2Motivation
Biblio sever
Legacy database
Plain text files
Support seamless access to autonomous and
heterogeneous information sources.
3Applications
Lowest price of the DVD The Matrix?
Buyer 1
Supplier 1
Buyer 2
Supplier 2
Integrator
Supplier M
Buyer M
4Mediation architecture
Mediator
Wrapper
Wrapper
Wrapper
Source 1
Source 2
Source n
TSIMMIS (Stanford), Garlic (IBM), Infomaster
(Stanford), Disco (INRIA), Information Manifold
(ATT), Hermes(UMD), Tukwila (UW), InfoSleuth
(MCC),
5Challenges
- Sources are heterogeneous
- Different data models relational,
object-oriented, XML, - Different schemas and representations. E.g.,
- Keanu Reeves or Reeves, Keanu or Reeves,
K. etc. - Describe source contents
- Use source data to answer queries
- Sources have limited query capabilities
- Data quality
6Outline
- Basics theories of conjunctive queries
- Global-as-view (GAV) approach to data
integration - Local-as-view (LAV) approach to data integration
7Conjunctive Queries (CQs) in Datalog
- Most common form of query equivalent to
select-project-join (SPJ) queries - Useful for data integration
- Form q(X) - p1(X1), p2(X2),, pn(Xn).
- Head q(X) represents the query answers
- Body p1(X1), p2(X2),, pn(Xn) represents the
query conditions - Each pi(Xi) is called a subgoal
- Shared variables represent join conditions
- Constants represent Attributeconst selection
conditions - A relation can appear in multiple predicates
(subgoals)
8Conjunctive Queries example
- Schema
- student(name, courseNum), course(number,
Instructor) - SQL
- SELECT name
- FROM student, course
- WHERE student.courseNumcourse.number
AND instructorLi - Equal to
- ans(SN) - student(SN, CN), course(CN,Li).
- Predicates student and course correspond to
relations names - Two subgoals student(SN, CN) and course(CN,Li)
- Variables SN, CN. Constant Li
- Shared variable, CN, corresponds to
student.courseNumcourse.number - Variable SN in the head the answer to the query
9Answer to a CQ
- For a CQ Q on database D, the answer Q(D) is a
set of heads of Q if we - Substitute constants for variables in the body of
Q in all possible ways - Require all subgoals to be true
- Example ans(SN) - student(SN, CN),
course(CN,Li). - Tuples are also called EDB (external database)
facts student(Jack, 184), student(Tom,215), ,
course(184,Li), course(215,Li), - Answer Jack SN?Jack,CN?184
- Answer Tom SN?Tom,CN?215
- Answer Jack SN?Jack,CN?215 (duplicate
eliminated)
Course
Student
10Query containment
- For two queries Q1 and Q2, we say Q1 is contained
in Q2, denoted Q1?Q2, if any database D, we have
Q1(D) ?Q2(D). - We say Q1 and Q2 are equivalent, denoted Q1?Q2,
if Q1(D) ?Q2(D) and Q1(D) ? Q2(D). - Example
- Q1 ans(SN) - student(SN, CN), course(CN, Li).
- Q2 ans(SN) - student(SN, CN), course(CN, INS).
- We have Q1(D) ? Q2(D).
11Another example
- Q1 p(X,Y) - r(X,W), b(W,Z), r(Z,Y).
- Q2 p(X,Y) - r(X,W), b(W,W), r(W,Y).
- We have Q2 ?Q1
- Proof
- For any DB D, suppose p(x,y) is in Q2(D). Then
there is a w such that r(x,w), b(w,w), and r(w,y)
are in D. - For Q1, consider the substitution X? x, W? w, Z?
w, Y? y. - Thus the head of Q1 becomes p(x,y), meaning that
p(x,y) is also in Q1(D). - In general, how to test containment of CQs?
- Containment mappings
12Containment mappings
- A containment mapping from Q2 to Q1 Map
variables of Q2 to variables of Q1, such that - Head of Q2 becomes head of Q1
- Each subgoal of Q2 becomes some subgoal of Q1.
- It is not necessary that every subgoal of Q1 is
the target of some subgoal of Q2. - Q1 ?Q2 iff there is a containment mapping from
Q2 to Q1. - Note that the containment mapping is opposite the
containment --- it goes from the larger
(containing CQ) to the smaller (contained CQ). - Example
- Q1 p(X,Y) - r(X,W), b(W,Z), r(Z,Y).
- Q2 p(X,Y) - r(X,W), b(W,W), r(W,Y).
- Containment mapping from Q1 to Q2 X ? X, Y ? Y,
W ? W, Z ? W - No containment mapping from Q2 to Q1
- For b(W,W) in Q2, its only possible target in Q1
is b(W,Z) - However, we cannot have a mapping W?W and W?Z,
since each variable cannot be mapped to two
different variables
13Extending CQs
- CQs with built-in predicates
- We can add more conditions to variables in a CQ.
- Example
- student(name, GPA, courseNum),
course(number,instructor,year) - Q1(SN) - student(SN, G, CN), course(CN, Li),
Ggt3.5. - Q2(SN) - student(SN, G, CN), course(CN, Li),
Ggt3.5, Y lt 2002. - Q2(SN) ? Q1(SN).
- Datalog queries
- a (possibly infinite) set of CQs with (possibly)
recursion - Example r(Parent, Child)
- Query finding all ancestors of Tom
- ancestor(P,C) - r(P, C).
- ancestor(P,C) - ancestor(P,X), r(X, C).
- result(P) - ancestor(P, tom).
14Outline
- Basics theories of conjunctive queries
- Global-as-view (GAV) approach to data
integration - Local-as-view (LAV) approach to data integration
15GAV approach to data integration
- Readings
- Jeffrey Ullman, Information Integration Using
Logical Views, ICDT 1997. - Ramana Yerneni, Chen Li, Hector Garcia-Molina,
and Jeffrey Ullman, Computing Capabilities of
Mediators, SIGMOD 1999.
16Global-as-view Approach
med(Dealer,City,Make,Year) R1 R2
Mediator
R1(Dealer,City)
R2(Dealer, Make, Year)
- Mediator exports views defined on source
relations - med(Dealer,City,Make,Year) R1 R2
- A query is posted on mediator views
- SELECT FROM med
- WHERE Year 2001
- ans(D,C,M, 2001) - med(D,C,M,2001).
- Mediator expands query to source queries
- SELECT FROM R1, R2
- WHERE Year 2001
-
- ans(D,C,M,2001) - R1(D,C), R2(D,M,
2001).
17GAV Approach
- Project TSIMMIS at Stanford
- Advantages
- User queries are easy to define
- Query transformation generation is
straightforward - Disadvantages
- Not all source information is exported
- Not easily scalable every time a new source is
added, mediator views need to be changed. - Research issues
- Efficient query execution?
- Deal with limited source capabilities?
18Limited source capabilities
- Complete scans of relations not possible
- Reasons
- Legacy databases or structured files limited
interfaces - Security/Privacy
- Performance concerns
- Example 1 legacy databases with restrictive
interfaces
title
author
Given an author, return the books.
Ullman
DBMS
Knuth
TeX
19Another example Web search forms
www.imdb.com
20Problems
- How to describe source restrictions?
- How to compute mediator restrictions from
sources? - How to answer queries efficiently given these
restrictions? - How to compute as many answers as possible to a
query?
21Computing mediator restrictions
- Motivation do not want users to be frustrated by
submitting a query that cannot be answerable by
the mediator - Example
- Source 1 book(author?, title, price)
- Capability bff
- I.e., we must provide an author, and can get
title and price info - Source 2 review(title?, reviewer, rate)
- Capability bff
- I.e., we must provide a book title, and can get
other info - Mediator view
- MedView(A?,T,P,RV,RT) - book(A,T,P),review(T,RV,R
T). - Query on the mediator view
- Ans(RT) - MedView(A, db, P, RV, RT).
- I.e., find the review rates of DB books
- But the mediator cannot answer this query, since
we do not know the authors. - We want to tell the user beforehand what queries
can be answered
22Outline
- Basics theories of conjunctive queries
- Global-as-view (GAV) approach to data
integration - Local-as-view (LAV) approach to data
integration.
23Local-as-view (LAV) approach
Mediator
sources
- There are global predicates, e.g., car,
person, book, etc. - They can been seen as mediator views
- The content of each source is described using
these global predicates - A query to the mediator is also defined on the
global predicates - The mediator finds a way to answer the query
using the source contents
24Example
Mediator
S1(Dealer,City)
S2(Dealer,Make,Year)
- Global predicates Loc(Dealer,City),Sell(Dealer,Ma
ke,Year) - Source content defined on global predicates
- S1(Dealer,City) - Loc(Dealer, City).
- S2(Dealer,Make,Year) - Sell(Dealer, Make,
Year). - In general, each definition could be more
complicated, rather than direct copies. - Queries defined on global predicates.
- Q ans(D,M,Y) - Loc(D, windsor), Sell(D, M,
Y). - Users do not know source views.
- The mediator decides how to use source views to
answer queries. - Answering queries using views
- ans(D, M, Y) - S1(D,windsor), S2(D,M,Y).
25Answering queries using views
Mediator
Query
V1(Dealer,City)- Loc(Dealer, City). V2(Dealer,Mak
e,Year)-Sell(Dealer, Make, Year). V3(D,C,M,Y) -
Loc(D,C),Sell(D,M,Y). V4(D,C,M,Y) -
Loc(D,C),Sell(D,M,Y), Ylt1970.
- Source views can be complicated SPJs, arithmetic
comparisons, - Not easy to decide how to answer a query using
source views - Query ans(D,M) - Loc(D,windsor'),
Sell(D,M,Y). - Rewriting
- ans(D, M) - V3(D,windsor, M,Y).
- ans(D, M) - V1(D,windsor), V2(D,M,Y).
-
- Equivalent rewriting compute the same answer
as the query - A rewriting can join multiple source views
26Arithmetic comparisons
Mediator
V(D,C,M,Y)- Loc(D,C),Sell(D,M,Y),Ylt1970
- Comparisons can make the problem even trickier
- Query ans(D,M) - Loc(D,windsor'),
Sell(D,M,Y). - Rewriting ans(D,M) - V(D,windsor, M,Y).
- Contained rewriting only retrieve cars before
1970. - Query ans(D,M)- Loc(D, windsor'), Sell(D,M,Y),
Y lt 1960 - Rewriting ans(D,M) - V(D,windsor, M, Y), Y lt
1960
27Rewritings
- Given a query Q and a set of views V
- A conjunctive query P is called a rewriting of
Q using V if P only uses views in V, and P
computes a partial answer of Q. That is Pexp ?Q.
A rewriting is also called a contained
rewriting (CR). - A conjunctive query P is called an equivalent
rewriting (ER) of Q using V if P only uses views
in V, and P computes the exact answer of Q. That
is Pexp ? Q.
28Bucket algorithm
- Query
- q(x)-car(x), sell(x, d), loc(d, windsor).
- Views
- v1(x) - car(x).
- v2(x) - car(x), sell(x, d).
- v3(x,d) - sell(x, d), loc(d, windsor).
- v4(x) - sell(x, d), loc(d, windsor).
q(x)-v1(x), v2(x), v3(x,d). q(x)-v1(x),
v3(x,d). q(x)-v1(x), v4(x). q(x)-v2(x),
v3(x,d). q(x)-v2(x), v4(x).
29Projects using the LAV approach
- Projects Information Manifold, Infomaster,
Tukwila, - Advantages
- Scalable new sources easy to add without
modifying the mediator views - All we need to do is to define the new source
using the existing mediator views (predicates) - Disadvantages
- Hard to decide how to answer a query using views
- Reading Alon Halevy, Answering Queries Using
Views A Survey.