Data%20Exchange:%20Semantics%20and%20Query%20Answering - PowerPoint PPT Presentation

About This Presentation
Title:

Data%20Exchange:%20Semantics%20and%20Query%20Answering

Description:

Ronald Fagin -- IBM Almaden Research Center. Phokion G. Kolaitis -- UC Santa Cruz ... How to restructure data from a source schema to a target schema, according to a ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 34
Provided by: diaUni
Category:

less

Transcript and Presenter's Notes

Title: Data%20Exchange:%20Semantics%20and%20Query%20Answering


1
Data Exchange Semantics and Query Answering
  • Ronald Fagin -- IBM Almaden Research
    Center
  • Phokion G. Kolaitis -- UC Santa Cruz
  • Renee J. Miller -- University of Toronto
  • Lucian Popa -- IBM Almaden Research
    Center

IBM Almaden - November 12, 2002
(To appear in ICDT 2003)
2
Motivation and Overview
  • Data exchange problem
  • How to restructure data from a source schema to a
    target schema, according to a given specification
  • Main motivation for this work
  • Understanding of fundamental issues that lie
    underneath data exchange systems such as EXPRESS
    and Clio
  • Main challenge
  • Inherent under-specification
  • Specification (as we shall see) must be simple
    and intuitive, but
  • There are many ways in which the restructuring
    can be performed !
  • Question did we make the right choice in the
    design of Clio ?
  • Our approach (only relational case, so far)
  • Define and study universal solutions
  • Show this is the best way of performing data
    exchange
  • Study computational aspects
  • Study what happens after data exchange query
    answering

3
The Data Exchange Problem
Source schema S
Target schema T
?t
?st
I
J
  • Assume a data exchange setting
  • source schema S,
  • target schema T with a set ?t dependencies (see
    next)
  • set ?st source-to-target dependencies (see next)
  • The data exchange problem is the following
  • Input
  • source instance I
  • Output
  • target instance J such that ltI, Jgt ? ?st and
    J ? ?t
  • (call such J a solution for I )

4
Source-to-target Dependencies
  • For most practical purposes, ?st contains
  • source-to-target tuple-generating dependencies
    (tgds)
  • ?S(x) ? ?y ?T(x, y)
  • e.g.
  • DeptEmp(did, mgr_name, eid)
  • ? ?M. Dept (did, M, mgr_name) ? Emp (eid, did)
  • (Move data from source table DeptEmp into two
    target tables, Dept and Emp. The existential
    variable M is an unspecified manager id)

Dept did mgr_id mgr_name
DeptEmp did mgr_name eid
Emp eid did
5
Target Dependencies
  • The second, equally important, part of the
    specification, are the target dependencies ?t
  • tgds ?T(x) ? ?y ?T(x, y)
  • e.g.
  • Dept (did, mgr_id, mgr_name)
  • ? ?D. Emp (mgr_id, D)
  • (A foreign key constraint in the target)
  • equality generating dependencies (egds)
  • ?T(x) ? (x1x2)
  • e.g.
  • Emp (e, d1) ? Emp (e, d2) ? (d1 d2)
  • (A target key constraint)

Dept did mgr_id mgr_name
Emp eid did
6
Questions (To be Answered Next)
  • When more than one solution exists, how do we
    choose a best one ?
  • How do we compute a best solution ?
  • Is there always a solution ? Is there always a
    best solution ?
  • How does query answering on the chosen solution
    behave ?

7
Universal Solutions Best Solutions
8
Existence of Multiple Solutions
source
target
X0 , Y0 , Z0 represent unknown values (or
nulls)
P
A B C
T
Q
A B C
A B C
R
A B C
  • There may be many solutions for the target
    instance (J, J1, J2, etc.)
  • However, J seems to be more general
  • there exist homomorphisms h1 J ? J1 and h2 J ?
    J2 (see definition next)
  • but none from J1 or J2 to J
  • intuitively, J1 and J2 have extra information

9
Homomorphisms
  • As we have seen, the values of a target instance
    can be either
  • constants (i.e. values coming from the source
    instance), or
  • nulls (unknown values)
  • Definition. Assume J1 and J2 are such target
    instances. A homomorphism h J1 -gt J2 is a
    mapping from values of J1 to values of J2 such
    that
  • h(c) c, for constants c
  • (nulls of J1 can be mapped to
    any values of J2)
  • for every tuple lta1, , angt in relation T of
    instance J1
  • lt h(a1), h(an) gt must be a tuple in
    relation T of instance J2
  • Example

10
Universal Solution
  • Definition. Assume a data exchange setting (S, T,
    ?st, ?t). Given source instance I, a universal
    solution for I is a target instance J such that
  • (1) J is a solution for I
  • (2) for every solution J for I,
  • there exists homomorphism h J ? J
  • For the previous example, J is a universal
    solution. J1 and J2 are not.
  • Among all solutions, universal solutions are
    special
  • They contain no more and no less than the amount
    of information given by the specification

11
  • Fact
  • Uniqueness up to homomorphic equivalence
  • If J1 and J2 are universal for I then there are
    homomorphisms between J1 and J2 in both
    directions
  • Representation of the space of solutions
  • Sol(I1) Sol(I2) iff J1 and J2 are
    homomorphically equivalent
  • We adopt the universal solution as the notion of
    best solution.
  • Later we will see another justification for
    universal solutions in terms of query answering.

12
  • When do universal solutions exist ?
  • How do we compute a universal solution ?

13
Computing Universal Solutions
14
Chase
  • We canonically generate a universal solution by
    using the chase
  • Given source instance I, start with an empty
    target instance J
  • Generate tuples in J by applying the
    dependencies in ?st and ?t.
  • Example

Dept did mgr_id mgr_name
DeptEmp did mgr_name eid
Emp eid did
15
  • This process is repeatedly applied
  • for all the source tuples and for the generated
    tuples,
  • as long as there are dependencies that are not
    yet satisfied
  • The chase may be infinite (cyclic ?t )
  • or it may fail (e.g. target key constraints
    that are not possible to satisfy for the given
    source data)
  • (details in the paper)
  • However, if the chase successfully terminates,
    the resulting target instance is a solution.

16
Canonical Generation of Universal Solutions
  • Theorem. Assume a data exchange setting (S, T,
    ?st, ?t). Given source instance I
  • If the chase is finite and successful then its
    result is a universal solution.
  • If the chase fails then there is no solution.
  • Thus, the chase is a procedure for computing
    universal solutions, provided that
  • Solutions exist, and
  • The chase is finite
  • We call universal solutions computed by the chase
    canonical universal solutions

When can we guarantee that the chase is finite ?
17
Weakly Acyclic Sets of Dependencies
  • Some cyclic sets of dependencies may cause
    infinite chases
  • In such case no universal solution may exist, and
    the semantics of the data exchange is undefined
  • Still there are cyclic sets of dependencies that
    behave well and are quite useful
  • Weakly acyclic sets of dependencies (defined in
    the paper)
  • Cover many practical cases of target constraints
  • Allow for restricted cyclicity
  • The chase is guaranteed to be finite

18
Polynomial-Time Chase
Theorem. Let ? be a weakly acyclic set of
dependencies. For every instance K, the chase of
K with ? can be computed in polynomial time.
  • Corollary. Assume a data exchange setting (S, T,
    ?st, ?t) such that ?t is a weakly acyclic set of
    dependencies.
  • For every source instance I, the existence of a
    solution can be checked in polynomial time
  • For every source instance I, if a solution
    exists then a universal solution can be produced
    in polynomial time.

19
  • Next what happens after data exchange ?
  • In particular, how is subsequent query answering
    affected by our choice of a solution (universal
    solution) ?

20
Query Answering
21
Query Evaluation on a Solution
q
Target schema T
Source schema S
?t
?st
I
J
  • Assume a fixed data exchange setting with a
    source instance I. Suppose that a System 1
    chooses a solution J for data exchange.
  • A query q can now be asked against the target.
  • The evaluation of q, in System 1, is q(J).
  • However, a System 2 materializing a different
    solution J may give a different evaluation
    q(J).
  • Different choices of J (for the same I) imply
    possibly different query evaluations.
  • Is there a notion of the right set of answers
    to q with respect to I ?

22
Certain Answers
  • We will use a notion that has been around in the
    context of data integration and incomplete
    databases, where queries are asked against a set
    of possible databases.
  • Definition. Given I and q, a tuple t is a certain
    answer if
  • t ? q(J), for every solution J
  • Notation certain(q, I) the set of all
    certain answers
  • Thus, t is certain if it is in the answer of q on
    every solution.
  • The certain answers provide well-defined
    semantics to query answering because they are
    independent of the choice of a solution.

23
  • Can we compute the certain answers based just on
    our chosen (universal) solution ?

24
Positive Queries
  • Proposition. Assume a data exchange setting (S,
    T, ?st, ?t) and a source instance I.
  • Let q be a positive query. If J is a universal
    solution, then certain(q, I) q(J)? .
  • Let J be a solution such that for every positive
    query q we have that certain(q, I) q(J)? . Then
    J is a universal solution.
  • Note In the above
  • Positive query means union of SPJ queries
  • q(J)? means evaluate q on J and then throw away
    tuples that contain nulls)
  • Thus, the certain answers of positive queries can
    be computed by evaluating them on any universal
    solution.
  • Moreover, this property characterizes universal
    solutions.

25
Conjunctive Queries with Inequalities
  • The situation changes when negation is involved
    (even in the very simple form of ?).
  • Example
  • It can be verified that
  • lta0,a0gt ? q(J)?, but lta0,a0gt ? q(J2) (thus, not
    a certain answer).
  • Hence certain(q, I) ? q(J)?
  • The universal solution gives extra answers

26
  • For conjunctive queries with inequalities, we
    have seen that simple query evaluation on a
    universal solution is not enough for computing
    the certain answers.
  • Question Can we find a different SQL query q
    such that when evaluated on a universal solution
    gives the set of certain answers of q ?
  • There are examples for which such query q
    exists.
  • However, we show next that the answer is no, in
    general.

27
Complexity Two or More Inequalities
Theorem. Computing the certain answers of unions
of conjunctive queries with at most two
inequalities per term is coNP-hard, even in a
restricted data exchange setting (LAV).
  • AD98 proved a similar result for the case of
    conjunctive queries with six or more
    inequalities.
  • The coNP-hardness implies
  • the certain answers cannot be computed by
    evaluating the query q (or any other SQL query
    q) on a polynomial-time generated universal
    solution (unless P NP).

28
Complexity One Inequality
Theorem. Assume a data exchange setting (S, T,
?st, ?t) such that ?t is a weakly acyclic set of
dependencies. Let q be a union of conjunctive
queries with at most one inequality per term. Let
I be a source instance and let J be an arbitrary
universal solution for I. Then there exists a
polynomial-time algorithm with input J that
computes certain(q, I).
  • Thus, computing the certain answers for such
    queries is a tractable problem.
  • Moreover, this computation can take place on any
    universal solution.
  • The universal solution has all the information
    needed to compute the certain answers.
  • We show next that the problem of computing the
    certain answers, even for this tractable case,
    cannot be solved by means of SQL query
    evaluation.

29
First-Order Inexpressibility
Theorem. There exists a data exchange setting and
a boolean conjunctive query q with one
inequality, for which there is no first-order
query q over the canonical universal solution
such that certain(q, I) q(J) .
  • This is a strong inexpressibility result that
    shows that in data exchange we cannot use the
    notion of certain answers for answering queries
    with inequalities.
  • (In practice, instead of going for certain
    answers, we should just use query evaluation on
    the universal solution)
  • The proof uses an original combination of finite
    model theory techniques and the chase.

30
Conclusions
  • Universal solutions are a good candidate for
    using in data exchange
  • Clio produces such universal solution (in the
    relational case)
  • All universal solutions are equally good for
    answering positive queries.
  • Simple query evaluation has the same semantics as
    that of the certain answers.
  • For queries with inequalities, different
    universal solutions may give different query
    evaluations which may yet be different from the
    certain answers.
  • There is no hope to find the certain answers by
    means of SQL query evaluation on a universal
    solution

31
Future Work
  • Among all universal solutions, is there a
    universal solution that approximates, in a best
    way, the certain answers ? If yes, can this be
    computed efficiently ?
  • Extension to semantics and query answering for
    data exchange in the nested (XML) case.

32
The End
33
Relationship to Clio
  • Source-to-target tgds are the same formalism that
    Clio uses internally (for the relational case)
  • ?st is generated by Clio in the semantic
    translation phase VLDB02 from correspondences.
  • User input
  • Then Clio generates, based on ?st , a set of
    queries in the data translation phase VLDB02
  • These queries compute a solution
  • Is Clios solution a good one ? (Since other
    solutions are also possible)
  • Here we try to understand, formally, the concept
    of good solutions
  • ?t is more general than the target constraints
    that Clio can currently handle.
Write a Comment
User Comments (0)
About PowerShow.com