Title: Searching and Integrating Information on the Web
1Searching and Integrating Information on the Web
- Seminar 2 Data Integration
- Professor Chen Li
- UC Irvine
2Motivation
Biblio sever
Legacy database
Plain text files
Support seamless access to autonomous and
heterogeneous information sources.
3Applications
Lowest price of the DVD The Matrix?
Buyer 1
Supplier 1
Buyer 2
Supplier 2
Integrator
Supplier M
Buyer M
4Mediation architecture
Mediator
Wrapper
Wrapper
Wrapper
Source 1
Source 2
Source n
TSIMMIS (Stanford), Garlic (IBM), Infomaster
(Stanford), Disco (INRIA), Information Manifold
(ATT), Hermes(UMD), Tukwila (UW), InfoSleuth
(MCC),
5Challenges
- Sources are heterogeneous
- Different data models relational,
object-oriented, XML, - Different schemas and representations
- Keanu Reeves or Reeves, Keanu or Reeves,
K. etc. - Describe source contents
- Use source data to answer queries
- Sources have limited query capabilities
- Data quality
6Outline
- Basics theories of conjunctive queries
- Global-as-view (GAV) approach to data
integration - Local-as-view (LAV) approach to data integration
7Basics conjunctive queries
- Reading Ashok K. Chandra and Philip M. Merlin,
Optimal implementation of conjunctive queries in
relational data bases, STOC, 77-90, 1977. - Fundamental for data integration
- Source content description
- Query description
- Plan formulation
8Conjunctive Queries (CQs)
- Most common form of query equivalent to
select-project-join (SPJ) queries - Useful for data integration
- Form q(X) - p1(X1),p2(X2),,pn(Xn)
- Head q(X) represents the query answers
- Body p1(X1),p2(X2),,pn(Xn) represents the query
conditions - Each pi(Xi) is called a subgoal
- Shared variables represent join conditions
- Constants represent Attributeconst selection
conditions - A relation can appear in multiple predicates
(subgoals)
9Conjunctive Queries example
- student(name,courseNum), course(number,instructor)
- SELECT name
- FROM student, course
- WHERE student.courseNumcourse.number
AND instructorLi - Equal to
- ans(SN) - student(SN, CN), course(CN,Li)
- Predicates student and course correspond to
relations names - Two subgoals student(SN, CN) and course(CN,Li)
- Variables SN, CN. Constant Li
- Shared variable, CN, corresponds to
student.courseNumcourse.number - Variable SN in the head the answer to the query
10Answer to a CQ
- For a CQ Q on database D, the answer Q(D) is set
of heads of Q if we - Substitute constants for variables in the body of
Q in all possible ways - Require all subgoals to be true
- Example ans(SN) - student(SN, CN),
course(CN,Li) - Tuples are also called EDB (external database)
facts student(Jack, 184), student(Tom,215), ,
course(184,Li), course(215,Li), - Answer Jack SN?Jack,CN?184
- Answer Tom SN?Tom,CN?215
- Answer Jack SN?Jack,CN?215 (duplicate
eliminated)
Course
Student
11Query containment
- For two queries Q1 and Q2, we say Q1 is contained
in Q2, denoted Q1?Q2, if any database D, we have
Q1(D) ?Q2(D). - We say Q1 and Q2 are equivalent, denoted Q1?Q2,
if Q1(D) ?Q2(D) and Q1(D) ? Q2(D). - Example
- Q1 ans(SN) - student(SN, CN), course(CN,Li)
- Q2 ans(SN) - student(SN, CN), course(CN,INS)
- We have Q1(D) ? Q2(D).
12Another example
- Q1 p(X,Y) - r(X,W), b(W,Z), r(Z,Y)
- Q2 p(X,Y) - r(X,W), b(W,W), r(W,Y)
- We have Q2 ?Q1
- Proof
- For any DB D, suppose p(x,y) is in Q2(D). Then
there is a w such that r(x,w), b(w,w), and r(w,y)
are in D. - For Q1, consider the substitution X? x, W? w, Z?
w, Y? y. - Thus the head of Q1 becomes p(x,y), meaning that
p(x,y) is also in Q1(D). - In general, how to test containment of CQs?
- Containment mappings
- Canonical databases
13Containment mappings
- Mapping from variables of CQ Q2 to variables of
CQ Q1, such that - Head of Q2 becomes head of Q1
- Each subgoal of Q2 becomes some subgoal of Q2
- It is not necessary that every subgoal of Q1 is
the target of some subgoal of Q2. - Example
- Q1 p(X,Y) - r(X,W), b(W,Z), r(Z,Y)
- Q2 p(X,Y) - r(X,W), b(W,W), r(W,Y)
- Containment mapping from Q1 to Q2 X ? X, Y ? Y,
W ? W, Z ? W - No containment mapping from Q2 to Q1
- For b(W,W) in Q2, its only possible target in Q1
is b(W,Z) - However, we cannot have a mapping W?W and W?Z,
since each variable cannot be mapped to two
different variables
14Example of containment mappings
- Example C1 p(X) - a(X,Y), a(Y,Z), a(Z,W)
- C2 p(X) - a(X,Y), a(Y,X)
- Containment mapping from C1 to C2 X ? X, Y ? Y,
Z ? X, W ? Y - No containment mapping from C2 to C1. Proof
- For the two heads, the mapping must have X ? X
- For a(X,Y) in C2, its target in C1 can only be
a(X,Y) (since X?X). Thus Y?Y. - However, for a(Y,X) in C2, its target, which must
be a(Y,X), does not exist in C1.
15Theorem of Containment Mappings
- Theorem Q1 ?Q2 iff there is a containment
mapping from Q2 to Q1. - Notice the direction is the opposite
- Proof (If)
- Suppose ? is a containment mapping from Q2 to Q1
- For any DB D, let tuple t is in Q1(D)
- t is produced by a substitution ? on the
variables of Q1 that makes all Q1s subgoals
facts in D. - Therefore, ? ? ? is a substitution for variables
of Q2 that produces t - Thus each t in Q1(D) must be in Q2(D)
Q1 p(X) - G1, G2, Gk
?
Q1 p(X,Y) - r(X,W), b(W,Z), r(Z,Y) Q2 p(X,Y)
- r(X,W), b(W,W), r(W,Y)
?
Q2 p(X) - H1, H2, Hj
16Proof (only if)
- Key idea frozen CQ
- Use a unique constant to replace a variable
- Frozen Q is a DB consisting of all the subgoals
of Q, with the chosen constants substituted for
variables - This DB is called a canonical database of the
query. - Example
- Q1 p(X,Y) - r(X,W), b(W,Z), r(Z,Y)
- Frozen Q1 X replaced by constant x0, W by
constant w0, Z by z0, Y by y0 - Result DB with r(x0, w0), b(w0, z0), r(z0, y0)
17Proof (only if) -- cont
- Let Q1 ?Q2. Let D be the frozen Q1. Let ? be the
substitution from those constants to the
variables in Q1. - Since we chose a unique constant for each
variable, this substitution exists. - Since Q1 ?Q2 the frozen head of Q1 must be in
Q2(D). Thus there is a substitution ? from Q2 to
D. - We can show that ? ? ? is a containment mapping
from Q2 to Q1 - The head of Q2 is mapped to the head of Q1.
- Each subgoal in Q2 is mapped to a subgoal in Q2.
Q1 p(X) - G1, G2, Gk
?
Q1 p(X,Y) - r(X,W), b(W,Z), r(Z,Y) Q2 p(X,Y)
- r(X,W), b(W,W), r(W,Y)
?
Q2 p(X) - H1, H2, Hj
18Testing query containment
- To test Q1 ?Q2.
- Get a canonical DB D of Q1.
- Compute Q2(D)
- If Q2(D) contains the frozen head of Q1, then Q1
?Q2. otherwise not. - Testing containment between CQs is NP-complete.
- Some polynomial-time algorithms exist in special
cases.
19Extending CQs
- CQs with built-in predicates
- We can add more conditions to variables in a CQ.
- Example
- student(name, GPA, courseNum),
course(number,instructor,year) - ans(SN) - student(SN, G, CN), course(CN,Li),
Ggt3.5 - ans(SN) - student(SN, G, CN), course(CN,Li,
Y), Ggt3.5, Y lt 2002 - More results on CQs with built-in predicates
- Datalog queries
- a (possibly infinite) set of CQs with (possibly)
recursion - Example r(Parent, Child)
- Query finding all ancestors of Tom
- ancestor(P,C) - r(P, C)
- ancestor(P,C) - ancestor(P,X), r(X, C)
- result(P) - ancestor(P, tom)
20Further Reading
- Jeff Ullman, Principles of Database and
Knowledge Systems, Computer Science Press, 1988,
Volume 2.
21Outline
- Basics theories of conjunctive queries
- Global-as-view (GAV) approach to data
integration - Local-as-view (LAV) approach to data integration
22GAV approach to data integration
- Readings
- Jeffrey Ullman, Information Integration Using
Logical Views, ICDT 1997. - Ramana Yerneni, Chen Li, Hector Garcia-Molina,
and Jeffrey Ullman, Computing Capabilities of
Mediators, SIGMOD 1999.
23Global-as-view Approach
med(Dealer,City,Make,Year) R S
Mediator
R1(Dealer,City)
R2(Dealer, Make, Year)
- Mediator exports views defined on source
relations - med(Dealer,City,Make,Year) R1 R2
- A query is posted on mediator views
- SELECT FROM med
- WHERE Year 2001 ans(D,C,M) -
med(D,C,M,2001) - Mediator expands query to source queries
- SELECT FROM R1, R2
- WHERE Year 2001 ans(D,C,M,Y) - R1(D,C),
R2(D,M,2001)
24GAV Approach (cont)
- Project TSIMMIS at Stanford
- Advantages
- User queries easy to define
- Plan generation is straightforward
- Disadvantages
- Not all source information is exported
- What if users want to get dealers that may not
the city information? - Those dealers are not visible.
- Not easily scalable every time a new source is
added, mediator views need to be changed - Research issues
- Efficient query execution?
- Deal with limited source capabilities?
25Limited source capabilities
- Complete scans of relations not possible
- Reasons
- Legacy databases or structured files limited
interfaces - Security/Privacy
- Performance concerns
- Example 1 legacy databases with restrictive
interfaces
title
author
Given an author, return the books.
Ullman
DBMS
Knuth
TeX
26Another example Web search forms
www.imdb.com
27Problems
- How to describe source restrictions?
- How to compute mediator restrictions from
sources? - How to answer queries efficiently given these
restrictions? - How to compute as many answers as possible to a
query?
28Describe source capabilities using attribute
adornments. f free b bound u
unspecified cS chosen from a list S of
constants, e.g., state oS optional if
chosen, must be from a list S of constants A
search form is represented as multiple
templates (Title, Author, ISBN, Format,
Subject) b f u u
u ? 1 f b u u
u ? 1 u u u o
o ? 2 u u b u
u ? 3
1
2
3
29Computing mediator restrictions
- Motivation do not want users to be frustrated by
submitting a query that cannot be answerable by
the mediator - Example
- Source 1 book(author, title, price)
- Capability bff
- I.e., we must provide a title, and can get author
and price info - Source 2 review(title, reviewer, rate)
- Capability bff
- I.e., we must provide a book title, and can get
other info - Mediator view
- MedView(A,T,P,RV,RT) - book(A,T,P),review(T,RV,RT
) - Query on the mediator view
- Ans(RT) - MedView(A, db, P, RV, RT).
- I.e., find the review rates of DB books
- But the mediator cannot answer this query, since
we do not know the authors. - We want to tell the user beforehand what queries
can be answered
30Solutions Compute mediator capabilities
- Need algorithms that do the following
- Given
- Source relations with restrictions.
- Mediator views defined on source relations
- Union
- Join
- Selection
- Projection
- Main idea of the algorithms
- compute restrictions on mediator views
- minimize number of view templates
31Union views
- Assumption
- MedView - V1?V2
- We want to get all tuples from two sources that
satisfy a query condition - No mediator post-processing power
- Table to compute view adornments
- E.g., f, os3 ? os3
- cs2, os3 ? cs2?s3
- Invalid combination b,u ? -
V2
V1
32Union views with postprocessing
- Mediator can postprocess results from a source,
and check if the results satisfy certain
conditions - Thus some entries are more relaxing
- Essentially o can be treated as f, and u
can be treated as f - E.g., f, os3 ? f instead of os3
- cs2, os3 ? cs2 instead of cs2?s3
- b,u ? b instead of invalid combination
V2
V1
33Join views with passing bindings
- Assumption
- MedView - V1 JOIN V2
- The mediator can pass bindings from V1 to V2
- So the join order matters
V2
V1
34Other views
- Union
- Join
- Selection
- Projection
- Multiple views
35Concise template description
- Some adornments subsume other adornments
- E.g. f subsumes b, since every query
supported by b is also supported by f - Adornment graph subsumption relationships
- Use the graph to compress templates
experiments shrank 26 ? 8 templates
f
n1
n2
Adornment n1 is at least as restrictive as
adornment n2
b
o
n1
n2
Adornment n1 is at least as restrictive as
adornment n2, if the constant set of n1 is a
subset of that of n2
u
c
Adornment graph
36Outline
- Basics theories of conjunctive queries
- Global-as-view (GAV) approach to data
integration - Local-as-view (LAV) approach to data integration
37Local-as-view (LAV) approach
Mediator
sources
- There are global predicates, e.g., car,
person, book, etc. - They can been seen as mediator views
- The content of each source is described using
these global predicates - A query to the mediator is also defined on the
global predicates - The mediator finds a way to answer the query
using the source contents
38Example
Mediator
S1(Dealer,City)
S2(Dealer,Make,Year)
- Global predicates Loc(Dealer,City),Sell(Dealer,Ma
ke,Year) - Source content defined on global predicates
- S1(Dealer,City) - Loc(Dealer,City)
- S2(Dear,Make,Year) - Sell(Dear,Make,Year)
- In general, each definition could be more
complicated, rather than direct copies. - Queries defined on global predicates.
- Q ans(D,M,Y) - Loc(D,irvine), Sell(D,M,Y)
- Users do not know source views.
- The mediator decides how to use source views to
answer queries. - Answering queries using views
- ans(D,M,Y) - S1(D,irvine), S2(D,M,Y)
39Another LAV Example
- Mediator predicates car(C), sell(Car, Dealer),
loc(dealer, city) - Views
- v1(x) - car(x)
- v2(x) - car(x), sell(x, d)
- v3(x,d) - sell(x, d), loc(d, la)
- v4(x) - sell(x, d), loc(d, la)
- Query q(x) - car(x), sell(x, d), loc(d, la)
40Open-world assumption (OWA) and Close-world
assumption (CWA)
W1(Make, Dealer) - car(Make, Dealer) W2(Make,
Dealer) - car(Make, Dealer)
- W1 and W2 have some car tuples.
- E.g. W1 and W2 are from two different web sites.
- W1 and W2 have all car tuples.
- E.g. W1 and W2 are computed from the same car
table in a database.
41Projects using the LAV approach
- Projects Information Manifold, Infomaster,
Tukwila, - Advantages
- Scalable new sources easy to add without
modifying the mediator views - All we need to do is to define the new source
using the existing mediator views (predicates) - Disadvantages
- Hard to decide how to answer a query using views
42Reading
- Alon Halevy, Answering Queries Using Views A
Survey.
43Answering queries using views
Mediator
Query
V(D,C,M,Y) - Loc(D,C),Sell(D,M,Y)
- Source views can be complicated SPJs, arithmetic
comparisons, - Not easy to decide how to answer a query using
source views - Query ans(D,M) - Loc(D,'irvine'),
Sell(D,M,Y). - Rewriting ans(D,M) - V(D,irvine, M,Y)
- Equivalent rewriting compute the same answer
as the query - A rewriting can join multiple source views
- This problem exists in many other applications
- data warehousing
- web caching
- query optimizations
44Arithmetic comparisons
Mediator
V(D,C,M,Y)- Loc(D,C),Sell(D,M,Y),Ylt1970
- Comparisons can make the problem even trickier
- Query ans(D,M) - Loc(D,'irvine'),
Sell(D,M,Y). - Rewriting ans(D,M) - V(D,irvine, M,Y)
- Contained rewriting only retrieve cars before
1970. - Query ans(D,M) - Loc(D, 'irvine'), Sell(D,M,Y),
Y lt 1960 - Rewriting ans(D,M) - V(D,irvine,M,Y), Y lt 1960
45Dropping attributes in views
Mediator
Drop Year in the view V(D,C,M)-
Loc(D,C),Sell(D,M,Y),Ylt1970
- A variable in a CQ is called
- distinguished if it appears in the querys
head - nondistinguished otherwise
- The problem becomes even harder when we have
nondistinguished variables. - Query ans(D,M) - Loc(D,'irvine'), Sell(D,M,Y),
Ylt1960 - No rewriting! Since we do not have Year
information. - Query ans(D,M) - Loc(D,'irvine'), Sell(D,M,Y),
Ylt1980 - Contained rewriting ans(D,M) - V(D, irvine,
M)
46Problems
Query
Source views
- How to answer a query using views?
- We will focus on the case where both the query
and views are simply conjunctive.
47Query Expansion
- For each query P on views, we can expand P using
the view definitions, and get a new query,
denoted as Pexp, on the base tables. - Pexp can be considered to be the real meaning
of the query. - Example
- View V(D,C,M) - Loc(D,C), Sell(D,M,Y)
- A query P using V ans(D,M) - V(D,la,M)
- Expansion ans(D,M) - Loc(D,la), Sell(D,M,Y)
Query P ans() - v1(), v2(), , vk()
Expansion Pexp ans()- p1,1(),,p1,i1(),,
pk,1(),,pk,ik()
48Rewritings
- Given a query Q and a set of views V
- A conjunctive query P is called a rewriting of
Q using V if P only uses views in V, and P
computes a partial answer of Q. That is Pexp ?Q.
A rewriting is also called a contained
rewriting (CR). - A conjunctive query P is called an equivalent
rewriting (ER) of Q using V if P only uses views
in V, and P computes the exact answer of Q. That
is Pexp ? Q. - A query P is called a maximally-contained
rewriting of Q using V if P is a union of CRs of
Q using V, and for any CR P1of Q, the answer to P
contains the answer to query P1, that is, P1exp ?
Pexp. - See earlier slides for examples
- Notice that all these definitions depend on the
language of the rewriting considered. Here we
consider conjunctive queries.
49Focus MiniCon algorithm
- MiniCon Algorithm Rachel Pottinger and Alon
Levy, A scalable algorithm for answering queries
using views, VLDB 2000. - See also The Shared-variable-bucket algorithm by
Prasenjit Mitra "An Algorithm for Answering
Queries Efficiently Using Views" in Proceedings
of the Australasian Database Conference, Jan
2001. - Formulation
- Input a conjunctive query Q and a set V of
conjunctive views - Output an maximally-contained rewriting (MCR) of
Q using V - Main idea
- For each query subgoal and for each view
- Check if the view can be used to answer the
query subgoal, and if so, in what form - Some shared variables are treated carefully
- Combine views to answer all query subgoals
- Reduced to a set-cover problem
50Example
- Query q(x) - car(x), sell(x, d), loc(d, la)
- Views
- v1(x) - car(x)
- v2(x) - car(x), sell(x, d)
- v3(x,d) - sell(x, d), loc(d, la)
- v4(x) - sell(x, d), loc(d, la)
51MCDs (enhanced Buckets)
- For query subgoal car(x), its MCD includes all
views that can answer this subgoal - v1(x), v2(x)
- MCD of query subgoal sell(x,d)
- v3(x,d) only
- but not v2(x)! Because
- Variable d is nondistinguished, i.e., it is not
exported. - Variable d is shared by another query subgoal,
loc(d,la). If we were to use v2(x) to answer
query subgoal sell(x,d), we cannot get the dealer
info to join with the other view to answer
loc(d,la). - MCD of query subgoal loc(d,la)
- v3(x,d)
52Multi-subgoal MCD
- MCD of query subgoals sell(x,d),loc(d,la)
- v4(x)
- If v4(x) is used to answer query subgoal
sell(x,d), then the query subgoal loc(d,la)
must be answered using v4(x) as well. - The reason is that d is shared by two query
subgoals, and the corresponding variable in v4(x)
is not exported.
53General rules
- For a query subgoal G and a view subgoal H in
view W, the MiniCon algorithm considers a mapping
from G to H - In this mapping, a query variable X is mapped to
a view variable A - Four possible cases
- Case 1 X is dist., A is dist.. OK.
- A is exported, so can join with other views.
- Case 2 X is nondist., A is dist.. OK.
- Same as above
- Case 3 X is dist., A is nondist.. NOT OK.
- X needs to be in the answer, but A is not
exported. - Case 4 X is nondist., A is nondist..
- Then all the query subgoals using X must be able
to be mapped to other subgoals in view W. - Reason since A is not exported in W, its
impossible for W to join with other views to
answer conditions involving X. - I.e., either NONE or ALL.
54Combine MCDs to cover query subgoals
- Problem
- q(x) - car(x), sell(x,d), loc (d,la")
- v1(x) - car(x)
- v2(x) - car(x), sell(x,d)
- v3(x,d) - sell(x,d), loc(d,la")
- v4(x) - sell(x,d), loc (d,la")
- MCDs
- car(x) v1(x), v2(x)
- sell(x,d) v3(x,d)
- loc(d,"ca") v3(x,d)
- sell(x,d),loc(d,la") v4(x)
- Contained rewritings - using MCDs to cover all
query subgoals, without overlap - P1 q(x) - v1(x), v3(x,d), v3(x,d)
- P2 q(x) - v2(x), v3(x,d), v3(x,d)
- P3 q(x) - v1(x), v4(x)
- P4 q(x) - v2(x), v4(x)
- MCR union of these four contained rewritings.
55Related references
Query
Source views
- Other algorithms on AQUV
- Bucket, Inverse-rule
- Generating efficient equivalent rewritings of
queries using views - CoreCover algorithm Afrati, Li, Ullman,
SIGMOD01 - Handling arithmetic comparisons and dropped
attributes - Afrati, Li, Mitra, PODS02
- Afrati, Li, Mitra, EDBT04