Title: Computing Full Disjunctions
1Computing Full Disjunctions
- Yaron Kanza
- Yehoshua Sagiv
- The Selim and Rachel Benin
- School of Engineering
- and Computer Science
- The Hebrew University of Jerusalem
2Overview of the Talk
- OR-semantics and weak semantics for querying
incomplete data - Complexity of query evaluation
- Full disjunctions as a special case of weak
semantics - Generalizing full disjunctions the join
constraints are not restricted to be equality
constraints - Lower bounds for some related problems
3Querying Incomplete Data Requires a Special
Semantics
- Usually, answers to a query are complete
assignments of database objects (or values) to
the query variables - Consequently, partial information is lost
- For example, dangling tuples are lost when
joining several relations - The purpose of outerjoins and full disjunctions
is to solve this problem, i.e., answers could be
partial assignments (to some of the variables)
4Querying Incomplete Semistructured Data
- In semistructured data, incompleteness of data is
prevalent - OR-semantics and weak semantics were introduced
so that queries over semistructured data would
return maximal answers rather than complete
answers Kanza, Nutt Sagiv 1999
5In the Semistructured Data Model
- Both data and queries are labeled rooted directed
graphs - Query nodes are variables
- Database nodes are objects
- Matchings are assignments of database objects to
query variables, such that - The database root is assigned to the query root,
and - Labels are preserved
61
movie
actor
movie
2
4
3
title
title
name
5
8
date of birth
year
10
Zelig
Antz
year
Woody Allen
language
11
9
1/12/1935
7
6
1998
1983
English
director
acted in
acted in
A Semistructured Database About Movies
7A Query
v1
actor
movie
name
v3
title
director
v2
w3
w1
language
date of birth
w4
w2
acted in
Under complete semantics, the query returns
actor-movie pairs, such that the actor played in
the movie and was also the director of the movie
81
movie
actor
movie
2
4
3
title
title
name
5
8
date of birth
year
10
Zelig
Antz
year
Woody Allen
language
11
9
1/12/1935
7
6
1998
1983
English
director
acted in
v1
actor
acted in
movie
name
v3
title
director
v2
A complete matching of the query variables to
database objects
w3
w1
language
date of birth
w4
w2
acted in
9Constraints on Complete Matchings
- The root constraint is satisfied if the query
root is mapped to the database root -
- A query edge is an edge constraint
- A query edge with a label l is satisfied if it is
mapped to a database edge with the same label l
Query Root
Database Root
101
movie
actor
movie
2
4
3
title
title
name
5
8
date of birth
year
10
Zelig
Antz
year
Woody Allen
language
11
9
1/12/1935
7
1998
1983
director
acted in
Suppose that Node 6 is missing
acted in
111
movie
actor
movie
2
4
3
title
title
name
5
8
date of birth
year
10
Zelig
Antz
year
Woody Allen
11
9
1/12/1935
7
1998
1983
director
acted in
v1
actor
acted in
movie
An incomplete matching
name
v3
title
director
v2
w3
w1
language
date of birth
This matching is maximal
w4
w2
acted in
12The Reachability Constrainton Partial Matchings
- A query node v that is mapped to a database
object o satisfies the reachability constraint if
there is a path from the query root to v, such
that all edge constraints along this path are
satisfied
13Weak Satisfaction ofEdge Constraints
- An edge constraint is weakly satisfied if it is
either - Satisfied (as defined earlier), or
- One (or more) of its nodes is mapped to a null
value
14Weak Matchings
- A partial matching is a weak matching if
- The root constraint is satisfied
- The reachability constraint is satisfied by every
query node that is mapped to a database node - Every edge constraint is weakly satisfied
151
movie
actor
movie
2
4
3
title
title
name
5
8
date of birth
year
10
Zelig
Antz
year
Woody Allen
11
9
1/12/1935
7
1998
1983
director
acted in
v1
actor
acted in
movie
name
v3
title
director
v2
A weak matching
w3
w1
language
date of birth
w4
w2
w2
acted in
161
movie
actor
movie
2
4
3
title
title
name
5
8
date of birth
year
10
Zelig
Antz
year
Woody Allen
11
9
1/12/1935
7
1998
1983
acted in
acted in
A Movie Database
Consider the case where the director edge is
missing
171
movie
actor
movie
2
4
3
title
title
name
5
8
date of birth
year
10
Zelig
Antz
year
Woody Allen
11
9
1/12/1935
7
1998
1983
acted in
v1
actor
acted in
movie
An incomplete matching that is not a weak
matching
name
v3
title
director
v2
w3
w1
language
date of birth
w4
w2
w2
acted in
18OR Matchings
- A partial matching is an OR matching if
- The root constraint is satisfied
- The reachability constraint is satisfied by every
query node that is mapped to a database node
Differently from a weak matching, in an OR
Matching, an edge constraint does not have to be
weakly satisfied
19Maximal Matchings
- Matchings can be represented as tuples (where
numbers are object ids) - A matching t1 subsumes a matching t2 if t1 can be
obtained from t2 by replacing some nulls in t2
with non-null values - A matching is maximal if no other matching
subsumes it - A query result consists only of maximal matchings
t1(1, 5, 2, null)
t2(1, null, 2, null)
20More Examples
211
movie
actor
movie
2
4
3
title
title
name
5
8
date of birth
year
10
Zelig
Antz
year
Woody Allen
language
11
9
1/12/1935
7
6
1998
1983
English
director
acted in
acted in
The Movie Database Before the Removals
221
In the result, the actor must be both an actor
in the movie and the director of the movie
movie
actor
movie
2
4
3
title
title
name
5
8
date of birth
year
10
Zelig
Antz
year
Woody Allen
language
11
9
1/12/1935
7
6
1998
1983
English
director
acted in
v1
actor
acted in
movie
name
v3
title
director
v2
w3
w1
language
A complete matching
It is also a maximal weak matching
It is also a maximal OR-matching
date of birth
w4
w2
acted in
231
In the result, if the actor and the movie are
assigned non-null values, then the actor must be
both an actor in the movie and the director of
the movie
movie
actor
movie
2
4
3
title
title
name
5
8
date of birth
year
10
Zelig
Antz
year
Woody Allen
language
11
9
1/12/1935
7
6
1998
1983
English
director
acted in
v1
actor
acted in
movie
name
v3
title
director
v2
w3
w1
language
date of birth
A second maximal weak matching
w4
w2
acted in
241
In the result, the actor either played in the
movie, directed the movie, or is not related at
all to the movie
movie
actor
movie
2
4
3
title
title
name
5
8
date of birth
year
10
Zelig
Antz
year
Woody Allen
language
11
9
1/12/1935
7
6
1998
1983
English
director
acted in
v1
actor
acted in
movie
name
v3
title
director
v2
w3
w1
language
date of birth
A maximal OR-matching
w4
w2
acted in
25Complexity of Evaluating Maximal Weak
Matchingsand Maximal OR Matchings
26Data Complexity
- Under data complexity, the time complexity is a
function of - the size of the database
27Two Alternatives forQuery Evaluation
- A naïve algorithm computes all matchings and then
removes subsumed matchings - A better algorithm avoids computing all matchings
ideally it only computes maximal matchings - Under data complexity, both algorithms are
polynomial time
28Input-Output Complexity
- Under input-output complexity, the time
complexity is a function of - the size of the query,
- the size of the database, and
- the size of the result
29A Naïve Algorithm vs.A Better Algorithm
- Under I-O complexity, a naïve algorithm is
exponential - Is there a better algorithm with a polynomial
time I-O complexity? - The answer is positive for DAG queries Kanza,
Nutt Sagiv 1999
30Cyclic Queries
- Theorem For a query Q and a database D,
- the set of all maximal weak matchings
- can be computed in O(q3dm2) time, where
- q is the size of the query, d is the size of the
- database and m is the size of the result
- (computing all maximal OR matchings has the
- same complexity)
31Full Disjunctions
What is the full disjunction of a set of
relations?
How are full disjunctions related to queries with
incomplete answers?
32Movies
Actors
Acted-in
Actors-that-Directed
The Full Disjunction of the Given Relations
33Movies
The Full Disjunction of the Given Relations
The full disjunction does not include subsumed
tuples
34Movies
Actors
Acted-in
Actors-that-Directed
The Full Disjunction of the Given Relations
The full disjunction does not include tuples that
are based on Cartesian Product rather than join
35In the Full Disjunctionof a Given Set of
Relations
Every tuple of the input is a part of at least
one tuple of the output
Tuples are joined as in a natural join, padded
with null values
The result includes only maximal connected
portions
36Motivation for Full Disjunctions
- Full disjunctions have been proposed by
Galiando-Legaria as an alternative for outerjoins
SIGMOD94 - Rajaraman and Ullman suggested to use full
disjunctions for information integration PODS96
37Computing Full Disjunctionsfor ?-acyclic
Relation Schemas
- Rajaraman and Ullman have shown how to evaluate
the full disjunction by a sequence of natural
outerjoins when the relation schemas are
?-acyclic - Hence, the full disjunction can be computed in
polynomial time, under input-output complexity,
when the relation schemas are ?-acyclic
38Weak Semantics GeneralizesFull Disjunctions
- Relations can be converted into a semistructured
database - The full disjunction can be expressed as the
union of several queries that are evaluated under
weak semantics
39Example
Movies
Actors
Acted-in
A node is created for each tuple
Edges are added between connected tuples, in both
directions
A root is added, and edges are added from the
root to every node
We use colors instead of labels
Creating The Database
40Example
Movies
Actors
Acted-in
A node is created for each relation schema
Edges are added between connected schemas, in
both directions
The number of queries is equal to the number of
schemas
In each query, the root is connected to a
different schema
r
Creating The Queries
41Example
r
Movies
Actors
Acted-in
r
Acted-in
Queries are Evaluated under Weak Semantics
Movies
Actors
42Example
r
Movies
Actors
Acted-in
r
Acted-in
Movies
Actors
Queries are Evaluated under Weak Semantics
43Example
r
Movies
Actors
Acted-in
r
Acted-in
Movies
Actors
Queries are Evaluated under Weak Semantics
44Example
r
Movies
Actors
Acted-in
r
Acted-in
Movies
Actors
Queries are Evaluated under Weak Semantics
45Example
r
Movies
Actors
Acted-in
r
Acted-in
Movies
Actors
Queries are Evaluated under Weak Semantics
46Example
r
Movies
Actors
Acted-in
r
Acted-in
Movies
Actors
47The Algorithm Computes Full Disjunctions in
Polynomial TimeUnder Input-Output Complexity
Theorem The full disjunction of relations r1,
, rn can be computed in O(n5s 2f 2) time, where
n is the number of relations, s is the total
size of all the relations and f is the size of
the result
48Generalizing Full Disjunctions
- In a full disjunction, tuples are joined
according to equality constraints as in a natural
join (or equi-join) - We can generalize full disjunctions to support
constraints that are not merely equality among
attributes
49Example
Movies (m-id, title, year, language,
location) Actors (a-id, name, date-of-birth) Acted
-in (a-id, m-id, role) Actors-that-Directed
(a-id, m-id)
Historical-Events (name, date, description) Histor
ical-Sites (Country, State, City, Site)
50The General Idea
- A set of constraints specifies how tuples should
be joined - The queries and the database are constructed
according to the given constraints - A pair of nodes is connected by an edge when it
satisfies the corresponding constraint - Queries are evaluated w.r.t. the database under
weak semantics
51Another Way of Generalizing Full Disjunctions
Use OR-Semantics
- Generate the queries and the database as before,
but the queries are evaluated under OR-semantics
(rather than weak semantics) - This relaxes the requirement that every pair of
tuples should be join consistent - Instead, a tuple of the full disjunction is only
required to be generated by database tuples that
form a connected subgraph, but need not be
pairwise join consistent
52Example
Employees (e-id, ename, city, dept-no) Departments
(dept-no, dname, building) Located-in (building,
city, street)
The Full Disjunction
53Example
Employees (e-id, ename, city, dept-no) Departments
(dept-no, dname, building) Located-in (building,
city, street)
The Full Disjunction under OR-Semantics
54Two Related Problems
The Projection Problem Computing the projection
of the full disjunction on a given set of
attributes
The Restriction Problem Computing only those
tuples of the full disjunction that are non-null
on a given set of attributes
The projection problem and the restriction
problem cannot be computed in polynomial time
(under input-output complexity) unless PNP
55Conclusion
- Cyclic queries can be computed in polynomial time
(in the size of the query, the database and the
result) under either OR-semantics or weak
semantics - A reduction of full-disjunction evaluation to
query evaluation under weak semantics is
described - Using the reduction, full disjunctions can be
computed in polynomial time (in the size of the
relation schemas, the relations and the result)
56Conclusion (continued)
- Full disjunctions can be generalized in two ways
- By using OR-semantics instead of weak semantics
- By joining tuples according to general
constraints - Generalized full disjunctions can be useful in
the context of data integration from
heterogeneous sources - The projection problem and the restriction
problem have polynomial-time algorithms (under
input-output complexity) when the relations have
?-acyclic schemas, but not in the general case
57Thank You
Questions?