LocalasView Data Integration - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

LocalasView Data Integration

Description:

... as the subset of a query over the mediated schema the 'open world assumption' ... Sources may not have all the data from the domain 'open-world assumption' ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 29

Provided by: zack4

Learn more at: https://www.seas.upenn.edu

Category:

more less

Transcript and Presenter's Notes

Title: LocalasView Data Integration

1
Local-as-ViewData Integration

Zachary G. Ives
University of Pennsylvania
CIS 650 Database Information Systems
October 16, 2008

2
Administrivia

Next week
Please read Section 3, 6, 7 in the Deshpande et
al. survey
Midterm assignment will be handed out Thursday,
with a one week clock

3
Recall Kinds of Schema Mappings

Global As View (GAV)
?x M(x) ? ?y1y2 R1(x1y1) ? R2(x2y2) ?
c(x1y1x2y2) where x ? x1x2xn
Local As View (LAV)
?x,y,z MR(x,y) ? MS(y,z) ? c(xyz) ? ?w
R1(x1y1z1w) where x1y1z1 ? xyz
Global-Local As View (GLAV), aka Tuple-Generating
Dependencies (TGDs)
?xyz MR(x,y) ? MS(y,z) ? c1(xyz) ? ?y1y2 R1(w1y1)
? R2(w2y2) ? c2(xyzy1y2) ? where w1w2 ? xyz
Query is of the form Q(x) - MR(x1,y1),
MS(x2,y2),

4
Global-As-View Mappings

Very easy to implement doesnt require any new
logic on the part of a regular DBMS engine
For instance, Starburst QGM rewrites would work
But some drawbacks primarily that
We dont have a mechanism to describe when a
source contains only a subset of the data in the
mediated schema
e.g., All books from this source are of type
textbook
The mediated schema often needs to change as we
add sources it is somewhat brittle because
its defined in terms of sources

5
An Alternate ApproachLocal-As-View

When you integrate something, you have some
conceptual model of the integrated domain
Define that as a basic frame of reference,
everything else as a view over it
Local as View using mappings that are
conjunctive queries
May have overlapping/incomplete sources
Define each source as the subset of a query over
the mediated schema the open world assumption
We can use selection or join predicates to
specify that a source contains a range of values
ComputerBooks() ? Books(Title, , Subj), Subj
Computers

6
The Local-as-View Model

The basic model is the following
Local sources are views over the mediated
schema
Sources have the data mediated schema is
virtual
Sources may not have all the data from the domain
open-world assumption
The system must use the sources (views) to answer
queries over the mediated schema

7
Answering Queries Using Views

Assumption conjunctive queries, set semantics
Suppose we have a mediated schema show(ID,
title, year, genre), rating(ID, stars, source)
A conjunctive query might be q(t) - show(i,
t, y, g), rating(i, 5, s)
Recall intuitions about this class of queries
Adding a conjunct to a query (e.g., t 1997)
removes answers from the result but never adds
any
Any conjunctive query with at least the same
constraints conjuncts will give valid answers

8
Why This Class of Mappings Queries?

Abiteboul Duschka showed the data complexity of
answering queries using views with OWA

Note that the common inflationary semantics
version of Datalog must terminate in PTIME, even
with recursion

9
Query Answering

Suppose we have the query
q(t) - show(i, t, y, g), rating(i, 5, s)
and sources
5star(i) - show(i, t, y, g), rating(i, 5, s)
TVguide(t,y,g,r) - show(i, t, y, g), rating(i,
r, TVGuide)
movieInfo(i,t,y,g) - show(i, t, y, g)
critics(i,r, s) - rating(i, r, s)
goodMovies(t,y) - show(i, t, y, drama),
rating(i, 5, s), y 1997
We want to compose the query with the source
mappings but theyre in the wrong direction!

10
Query Answering

Suppose we have the query
q(t) - show(i, t, y, g), rating(i, 5, s)
and sources
5star(i) ? show(i, t, y, g), rating(i, 5, s)
TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
TVGuide)
movieInfo(i,t,y,g) ? show(i, t, y, g)
critics(i,r, s) ? rating(i, r, s)
goodMovies(t,y) ? show(i, t, y, drama),
rating(i, 5, s), y 1997
We want to compose the query with the source
mappings but theyre in the wrong direction!

11
Inverse Rules

We can take every mapping definition and invert
it, though sometimes we may have insufficient
information
If
5star(i) ? show(i, t, y, g), rating(i, 5, s)
then it is also the case that
show(i,??? ,??? ,??? ,???) ? 5star(i)
and we can write this as a datalog rule
show(i,-,- ,-,-) - 5star(i)
But how to handle the absence of attributes?
We know that there must be AT LEAST one instance
of ??? for each attribute for each show ID
So we might simply insert a NULL and define that
NULL means unknown (as opposed to missing)

12
But NULLs Lose Information

Suppose we take these rules and ask for
q(t) - show(i, t, y, g), rating(i, 5, s)
If we look at the rule
goodMovies(t,y) ? show(i, t, y, drama),
rating(i, 5, s), y 1997
By inspection, q(t) ? goodMovies(t,y)
But if apply our inversion procedure, we get
show(i, t, y, g) ? goodMovies(t,y), i NULL, g
drama, y 1997
rating(i, r, s) ? goodMovies(t,y), i NULL, r
5, s NULL
We need a special NULL so we can figure out
which IDs and ratings match up

13
The Solution Skolem Functions

Skolem functions
Conceptual perfect hash functions
Each function returns a unique, deterministic
value for each combination of input values
Every function returns a non-overlapping set of
values (Skolem function F will never return a
value that matches any of Skolem function Gs
values)
Skolem functions wont ever be part of the answer
set or the computation it doesnt produce real
values
Theyre just a way of logically generating
special NULLs

14
Query Answering Using Inverse Rules

Invert all rules using the procedures described
Take the query and the possible rule expansions
and execute them in a Datalog interpreter
Really the same process as GAV unfolding!
In the previous query, we expand with all
combinations of expansions of show and of rating
every possible way of combining and
cross-correlating info from different sources
Then discard unsatisfiable rewritings via
unification, i.e., substituting in constants from
the query for variables in the view
Finally, execute the union of all satisfiable
rewritings

15
Pros Cons of Inverse Rules

Works even with recursive queries, binding
patterns, FDs on schemas
Generally, they take view definitions, split
them, and re-join them to produce answers
Not very efficient
No treatment of lt, gt predicates
Can we do better? i.e., be more efficient?

16
The Bucket Algorithm

Given a query Q with relations and predicates
Create a bucket for each subgoal in Q
Iterate over each view (source mapping)
If source includes buckets subgoal
Create mapping between qs vars and the views
var at the same position
If satisfiable with substitutions, add to bucket
Do cross-product of buckets, see if result is
contained (exptime, but queries are probably
relatively small)
For each result, do a containment check to make
sure the rewriting is contained within the query

17
Lets Try a Bucket Example

Query
q(t) - show(i, t, y, g), rating(i, 5, s)
Sources
5star(i) ? show(i, t, y, g), rating(i, 5, s)
TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
TVGuide)
movieInfo(i,t,y,g) ? show(i, t, y, g)
critics(i,r,s) ? rating(i, r, s)
goodMovies(t,y) ? show(i, t, y, drama),
rating(i, 5, s), y 1997
good98(t,y) ? show(i, t, y, drama), rating(i,
5, s), y 1998

18
Populating the Buckets
19
Evaluation

On the board

20
Example of Containment Testing

Suppose we have two queriesq1(t) - show(i, t,
y, g), rating(i, 5, s) , y 1997 q2(t,y) -
show(i, t, y, drama), rating(i, 5, s)
Intuitively, q1 must contain the same or fewer
answers vs. q2
It has all of the same conditions, except one
extra conjunction (i.e., its more restricted)
Theres no union or any other way it can add more
data
We can say that q2 contains q1 because this holds
for any instance of our mediated schema

21
Checking Containment via Canonical Databases

To test for q1 µ q2
Create a canonical DB that contains a tuple for
each subgoal in q1
Execute q2 over it
If q2 returns a tuple that matches the head of
q1, then q1 µ q2
(This is an NP-complete algorithm in the size of
the query. Testing for full first-order logic
queries is undecidable!!!)
Lets see this for our example

22
Example Canonical DB

q1(t) - show(i, t, 1997, g), rating(i, 5, s)
q2(t,y) - show(i, t, y, drama), rating(i, 5,
s)

rating
show
Need to get tuple lttgt in executing q2 over this
database
What if q2 didnt ask for g drama?
23
Buckets, Rev. 2 The MiniCon Algorithm

A much smarter bucket algorithm
In many cases, we dont need to perform the
cross-product of all items in all buckets
Eliminates the need for the containment check
This and the Chase Backchase strategy of
Tannen et al are the two methods most used in
virtual data integration today
The chase procedure (though not the backchase)
can be achieved using a Datalog program that
looks much like the inverse rules!

24
Minicon Descriptions (MCDs)

Basically, a modification to the bucket approach
head homomorphism defines what variables must
be equated by the query
Variable-substituted version of the subgoals
Mapping of variable names
Info about whats covered
Property 1
If a variable occurs in the head of a query, then
there must be a corresponding variable in the
head of the MCD view
If a variable participates in a join predicate in
the query, then it must be in the head of the view

25
MCD Construction

For each subgoal of the query
For each subgoal of each view
Choose the least restrictive head homomorphism to
match the subgoal of the query
If we can find a way of mapping the variables,
then add MCD for each possible maximal
extension of the mapping that satisfies Property 1

26
MCDs for Our Example

5star(i) ? show(i, t, y, g), rating(i, 5, s)
TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
TVGuide)
movieInfo(i,t,y,g) ? show(i, t, y, g)
critics(i,r,s) ? rating(i, r, s)
goodMovies(t,y) ? show(i, t, 1997, drama),
rating(i, 5, s)
good98(t,y) ? show(i, t, 1998, drama),
rating(i, 5, s)

q(t) - show(i, t, y, g), rating(i, r, s), r 5
27
Combining MCDs

Now look for ways of combining pairwise disjoint
subsets of the goals
Greatly reduces the number of candidates!
Also proven to be correct without the use of a
containment check
Variations need to be made for
Constants in general (I sneaked those in)
Semi-interval predicates (x lt c)
Note that full-blown inequality predicates are
co-NP-hard in the size of the data, so they dont
work

28
MiniCon and LAV Summary

The state-of-the-art for AQUV in the relational
world of data integration
Its been extended to support conjunctive
XQuery as well
Scales to large numbers of views, which we need
in LAV data integration
Chase Backchase by Tannen et al.
A procedure that has very close connections to
inverse rules
Slightly more general in some ways but
Typically produces equivalent rewritings, not
maximally contained ones
Not always polynomial in the size of the data
(though for some useful settings it is!!!)