Title: LocalasView Data Integration
1Local-as-ViewData Integration
- Zachary G. Ives
- University of Pennsylvania
- CIS 650 Database Information Systems
- October 16, 2008
2Administrivia
- Next week
- Please read Section 3, 6, 7 in the Deshpande et
al. survey - Midterm assignment will be handed out Thursday,
with a one week clock
3Recall Kinds of Schema Mappings
- Global As View (GAV)
- ?x M(x) ? ?y1y2 R1(x1y1) ? R2(x2y2) ?
c(x1y1x2y2) where x ? x1x2xn - Local As View (LAV)
- ?x,y,z MR(x,y) ? MS(y,z) ? c(xyz) ? ?w
R1(x1y1z1w) where x1y1z1 ? xyz - Global-Local As View (GLAV), aka Tuple-Generating
Dependencies (TGDs) - ?xyz MR(x,y) ? MS(y,z) ? c1(xyz) ? ?y1y2 R1(w1y1)
? R2(w2y2) ? c2(xyzy1y2) ? where w1w2 ? xyz - Query is of the form Q(x) - MR(x1,y1),
MS(x2,y2),
4Global-As-View Mappings
- Very easy to implement doesnt require any new
logic on the part of a regular DBMS engine - For instance, Starburst QGM rewrites would work
- But some drawbacks primarily that
- We dont have a mechanism to describe when a
source contains only a subset of the data in the
mediated schema - e.g., All books from this source are of type
textbook - The mediated schema often needs to change as we
add sources it is somewhat brittle because
its defined in terms of sources
5An Alternate ApproachLocal-As-View
- When you integrate something, you have some
conceptual model of the integrated domain - Define that as a basic frame of reference,
everything else as a view over it - Local as View using mappings that are
conjunctive queries - May have overlapping/incomplete sources
- Define each source as the subset of a query over
the mediated schema the open world assumption - We can use selection or join predicates to
specify that a source contains a range of values - ComputerBooks() ? Books(Title, , Subj), Subj
Computers
6The Local-as-View Model
- The basic model is the following
- Local sources are views over the mediated
schema - Sources have the data mediated schema is
virtual - Sources may not have all the data from the domain
open-world assumption - The system must use the sources (views) to answer
queries over the mediated schema
7Answering Queries Using Views
- Assumption conjunctive queries, set semantics
- Suppose we have a mediated schema show(ID,
title, year, genre), rating(ID, stars, source) - A conjunctive query might be q(t) - show(i,
t, y, g), rating(i, 5, s) - Recall intuitions about this class of queries
- Adding a conjunct to a query (e.g., t 1997)
removes answers from the result but never adds
any - Any conjunctive query with at least the same
constraints conjuncts will give valid answers
8Why This Class of Mappings Queries?
- Abiteboul Duschka showed the data complexity of
answering queries using views with OWA
- Note that the common inflationary semantics
version of Datalog must terminate in PTIME, even
with recursion
9Query Answering
- Suppose we have the query
- q(t) - show(i, t, y, g), rating(i, 5, s)
- and sources
- 5star(i) - show(i, t, y, g), rating(i, 5, s)
- TVguide(t,y,g,r) - show(i, t, y, g), rating(i,
r, TVGuide) - movieInfo(i,t,y,g) - show(i, t, y, g)
- critics(i,r, s) - rating(i, r, s)
- goodMovies(t,y) - show(i, t, y, drama),
rating(i, 5, s), y 1997 - We want to compose the query with the source
mappings but theyre in the wrong direction!
10Query Answering
- Suppose we have the query
- q(t) - show(i, t, y, g), rating(i, 5, s)
- and sources
- 5star(i) ? show(i, t, y, g), rating(i, 5, s)
- TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
TVGuide) - movieInfo(i,t,y,g) ? show(i, t, y, g)
- critics(i,r, s) ? rating(i, r, s)
- goodMovies(t,y) ? show(i, t, y, drama),
rating(i, 5, s), y 1997 - We want to compose the query with the source
mappings but theyre in the wrong direction!
11Inverse Rules
- We can take every mapping definition and invert
it, though sometimes we may have insufficient
information - If
- 5star(i) ? show(i, t, y, g), rating(i, 5, s)
- then it is also the case that
- show(i,??? ,??? ,??? ,???) ? 5star(i)
- and we can write this as a datalog rule
- show(i,-,- ,-,-) - 5star(i)
- But how to handle the absence of attributes?
- We know that there must be AT LEAST one instance
of ??? for each attribute for each show ID - So we might simply insert a NULL and define that
NULL means unknown (as opposed to missing)
12But NULLs Lose Information
- Suppose we take these rules and ask for
- q(t) - show(i, t, y, g), rating(i, 5, s)
- If we look at the rule
- goodMovies(t,y) ? show(i, t, y, drama),
rating(i, 5, s), y 1997 - By inspection, q(t) ? goodMovies(t,y)
- But if apply our inversion procedure, we get
- show(i, t, y, g) ? goodMovies(t,y), i NULL, g
drama, y 1997 - rating(i, r, s) ? goodMovies(t,y), i NULL, r
5, s NULL - We need a special NULL so we can figure out
which IDs and ratings match up
13The Solution Skolem Functions
- Skolem functions
- Conceptual perfect hash functions
- Each function returns a unique, deterministic
value for each combination of input values - Every function returns a non-overlapping set of
values (Skolem function F will never return a
value that matches any of Skolem function Gs
values) - Skolem functions wont ever be part of the answer
set or the computation it doesnt produce real
values - Theyre just a way of logically generating
special NULLs
14Query Answering Using Inverse Rules
- Invert all rules using the procedures described
- Take the query and the possible rule expansions
and execute them in a Datalog interpreter - Really the same process as GAV unfolding!
- In the previous query, we expand with all
combinations of expansions of show and of rating
every possible way of combining and
cross-correlating info from different sources - Then discard unsatisfiable rewritings via
unification, i.e., substituting in constants from
the query for variables in the view - Finally, execute the union of all satisfiable
rewritings
15Pros Cons of Inverse Rules
- Works even with recursive queries, binding
patterns, FDs on schemas - Generally, they take view definitions, split
them, and re-join them to produce answers - Not very efficient
- No treatment of lt, gt predicates
- Can we do better? i.e., be more efficient?
16The Bucket Algorithm
- Given a query Q with relations and predicates
- Create a bucket for each subgoal in Q
- Iterate over each view (source mapping)
- If source includes buckets subgoal
- Create mapping between qs vars and the views
var at the same position - If satisfiable with substitutions, add to bucket
- Do cross-product of buckets, see if result is
contained (exptime, but queries are probably
relatively small) - For each result, do a containment check to make
sure the rewriting is contained within the query
17Lets Try a Bucket Example
- Query
- q(t) - show(i, t, y, g), rating(i, 5, s)
- Sources
- 5star(i) ? show(i, t, y, g), rating(i, 5, s)
- TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
TVGuide) - movieInfo(i,t,y,g) ? show(i, t, y, g)
- critics(i,r,s) ? rating(i, r, s)
- goodMovies(t,y) ? show(i, t, y, drama),
rating(i, 5, s), y 1997 - good98(t,y) ? show(i, t, y, drama), rating(i,
5, s), y 1998
18Populating the Buckets
19Evaluation
20Example of Containment Testing
- Suppose we have two queriesq1(t) - show(i, t,
y, g), rating(i, 5, s) , y 1997 q2(t,y) -
show(i, t, y, drama), rating(i, 5, s) - Intuitively, q1 must contain the same or fewer
answers vs. q2 - It has all of the same conditions, except one
extra conjunction (i.e., its more restricted) - Theres no union or any other way it can add more
data - We can say that q2 contains q1 because this holds
for any instance of our mediated schema
21Checking Containment via Canonical Databases
- To test for q1 µ q2
- Create a canonical DB that contains a tuple for
each subgoal in q1 - Execute q2 over it
- If q2 returns a tuple that matches the head of
q1, then q1 µ q2 - (This is an NP-complete algorithm in the size of
the query. Testing for full first-order logic
queries is undecidable!!!) - Lets see this for our example
22Example Canonical DB
- q1(t) - show(i, t, 1997, g), rating(i, 5, s)
- q2(t,y) - show(i, t, y, drama), rating(i, 5,
s)
rating
show
Need to get tuple lttgt in executing q2 over this
database
What if q2 didnt ask for g drama?
23Buckets, Rev. 2 The MiniCon Algorithm
- A much smarter bucket algorithm
- In many cases, we dont need to perform the
cross-product of all items in all buckets - Eliminates the need for the containment check
- This and the Chase Backchase strategy of
Tannen et al are the two methods most used in
virtual data integration today - The chase procedure (though not the backchase)
can be achieved using a Datalog program that
looks much like the inverse rules!
24Minicon Descriptions (MCDs)
- Basically, a modification to the bucket approach
- head homomorphism defines what variables must
be equated by the query - Variable-substituted version of the subgoals
- Mapping of variable names
- Info about whats covered
- Property 1
- If a variable occurs in the head of a query, then
there must be a corresponding variable in the
head of the MCD view - If a variable participates in a join predicate in
the query, then it must be in the head of the view
25MCD Construction
- For each subgoal of the query
- For each subgoal of each view
- Choose the least restrictive head homomorphism to
match the subgoal of the query - If we can find a way of mapping the variables,
then add MCD for each possible maximal
extension of the mapping that satisfies Property 1
26MCDs for Our Example
- 5star(i) ? show(i, t, y, g), rating(i, 5, s)
- TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
TVGuide) - movieInfo(i,t,y,g) ? show(i, t, y, g)
- critics(i,r,s) ? rating(i, r, s)
- goodMovies(t,y) ? show(i, t, 1997, drama),
rating(i, 5, s) - good98(t,y) ? show(i, t, 1998, drama),
rating(i, 5, s)
q(t) - show(i, t, y, g), rating(i, r, s), r 5
27Combining MCDs
- Now look for ways of combining pairwise disjoint
subsets of the goals - Greatly reduces the number of candidates!
- Also proven to be correct without the use of a
containment check - Variations need to be made for
- Constants in general (I sneaked those in)
- Semi-interval predicates (x lt c)
- Note that full-blown inequality predicates are
co-NP-hard in the size of the data, so they dont
work
28MiniCon and LAV Summary
- The state-of-the-art for AQUV in the relational
world of data integration - Its been extended to support conjunctive
XQuery as well - Scales to large numbers of views, which we need
in LAV data integration - Chase Backchase by Tannen et al.
- A procedure that has very close connections to
inverse rules - Slightly more general in some ways but
- Typically produces equivalent rewritings, not
maximally contained ones - Not always polynomial in the size of the data
(though for some useful settings it is!!!)