Title: Full Disjunctions: PolynomialDelay Iterators in Action
1Full Disjunctions Polynomial-Delay Iterators in
Action
VLDB 2006 Seoul, Korea
2Computing Full Disjunctions
- The full disjunction is a relational operator
that maximally combines data from several
relations - It extends the natural join by allowing
incompleteness - It extends the binary outerjoin to many relations
- This paper presents algorithms and optimizations
for computing full disjunctions - Theoretically, full disjunctions are more
tractable than previously known - Practically, a significant improvement over the
state-of-art, an iterator-like evaluation
3Contents
- Full Disjunctions
- Complexity
- Contributions
- Algorithms
- Algorithm NLOJ for Tree-Structured Schemes
- Algorithm PDelayFD for General Schemes
- Algorithm BiComNLOJ - Main Algorithm
- Experimental Results
- Conclusion
4Contents
- Full Disjunctions
- Complexity
- Contributions
- Algorithms
- Algorithm NLOJ for Tree-Structured Schemes
- Algorithm PDelayFD for General Schemes
- Algorithm BiComNLOJ - Main Algorithm
- Experimental Results
- Conclusion
5The Natural Join Operator
Climates
Accommodations
Sites
Climates Accommodations Sites
6The Natural Join Misses Information
Climates
Accommodations
Sites
Bahamas is not in Sites, so the natural join
misses it
Climates Accommodations Sites
7The Natural Join Misses Information
Climates
Accommodations
Bahamas is not in Sites, so the natural join
misses it
Mouth Logan is not in a city, hence missed
Climates Accommodations Sites
8The Natural Join Misses Information
A looser notion of join is neededone that
enables joining tuples from some of the tables
Climates
Accommodations
Bahamas is not in Sites, so the natural join
misses it
Mouth Logan is not in a city, hence missed
Climates Accommodations Sites
9The Natural Join Operator
A tuple of the join corresponds to a set of
tuples from the source relations
Climates
Accommodations
Sites
Join consistent Connected No Cartesian
product Complete One tuple from each relation
Climates Accommodations Sites
10Join-Consistent Sets of Tuples
A set T of tuples is join-consistent if every two
tuples of T are join-consistent
Two tuples t1 and t2 are join-consistent if for
every common attribute A 1. t1A and t2A are
non-null 2. t1A t2A
11Connected Sets of Tuples
A set of tuples is connected if its join graph is
connected
The join graph of a set T of tuples
- The nodes are the tuples of T
- An edge between every two tuples with a common
attribute
12Natural Join (w/o Cartesian Product)
Each tuple of the result corresponds to a set T
of tuples from the source relations
13Full Disjunction (Galindo-Legaria 1994)
Each tuple of the result corresponds to a set T
of tuples from the source relations
T is join consistent
1.
14An Example of a Full Disjunction
Climates
Accommodations
Sites
R
FD(R)
15An Example of a Full Disjunction
Climates
Accommodations
Sites
R
FD(R)
16An Example of a Full Disjunction
Climates
Accommodations
Sites
R
FD(R)
17An Example of a Full Disjunction
Climates
Accommodations
Sites
R
FD(R)
18An Example of a Full Disjunction
Climates
Accommodations
Sites
R
FD(R)
19An Example of a Full Disjunction
Climates
Accommodations
Sites
R
FD(R)
20Padding Joined Tuple Sets with Nulls
21The Outerjoin Operator
The outerjoin of two relations R1 and R2
22Example of an Outerjoin
Climates
Accommodations
23Combining Relations using Outerjoins
The outerjoin operator is not associative For
more than two relations, the result depends on
the order in which the outerjoin is applied
In general, outerjoins cannot maximally combine
relations (no matter what order is used)
Outerjoin is not suitable for combining more than
two relations!
24Contents
- Full Disjunctions
- Complexity
- Contributions
- Algorithms
- Algorithm NLOJ for Tree-Structured Schemes
- Algorithm PDelayFD for General Schemes
- Algorithm BiComNLOJ - Main Algorithm
- Experimental Results
- Conclusion
25Efficiency of Evaluation
The full-disjunction operator (as well as other
operators like the Cartesian product or the
natural join) can generate an exponential (in
the input size) number of tuples
Polynomial running time is not a suitable
yardstick
The usual notion Polynomial time in the
combined size of the input and the output
26History of Algorithms for Full Disjunctions
This paper linear dependence on F
number of relations number of tuples in the
DB number of tuples in the FD
F is typically very large Can be exponential in
the size of the database
n N F
27Polynomial Delay
One way to obtain an evaluation with a running
time linear in the output is to devise an
algorithm that acts as an iterator with an
efficient next() operator, that is,
An enumeration algorithm that runs with
polynomial delay
An enumeration algorithm runs with polynomial
delay if the time between every two successive
answers is polynomial in the size of the input
28Other Benefits of Polynomial Delay
- Incremental evaluation
- First tuples are generated quickly
- Full disjunctions are large, yet the user need
not wait for the whole result to be generated - Suitable for Web applications, where users expect
to get the first few pages quickly - In addition, the user can decide anytime that
enough information has been shown - Enable parallel query processing
- While one processor generates the FD tuples,
other processors apply further processing
29Contents
- Full Disjunctions
- Complexity
- Contributions
- Algorithms
- Algorithm NLOJ for Tree-Structured Schemes
- Algorithm PDelayFD for General Schemes
- Algorithm BiComNLOJ - Main Algorithm
- Experimental Results
- Conclusion
30Main Contributions
Substantial improvement over the state-of-art is
proved theoretically and experimentally
1. First algorithm for computing full
disjunctions with polynomial delay
2. First algorithm for computing full
disjunctions in time linear in the output
3. A general optimization technique for computing
full disjunctions Division into biconnected
components
31Contents
- Full Disjunctions
- Complexity
- Contributions
- Algorithms
- Algorithm NLOJ for Tree-Structured Schemes
- Algorithm PDelayFD for General Schemes
- Algorithm BiComNLOJ - Main Algorithm
- Experimental Results
- Conclusion
32Our Algorithms
Algorithm NLOJ Tree Schemes
Algorithm PDelayFD General Schemes
Division into Biconnected Components Optimization
Algorithm BiComNLOJ Main Algorithm - General
Schemes
33Contents
- Full Disjunctions
- Complexity
- Contributions
- Algorithms
- Algorithm NLOJ for Tree-Structured Schemes
- Algorithm PDelayFD for General Schemes
- Algorithm BiComNLOJ - Main Algorithm
- Experimental Results
- Conclusion
34Tree Schemes
Scheme graphs w/o cycles
In the scheme graph, the relation schemes are the
nodes and there is an edge between every two
schemes with one or more common attributes
35Left-Deep Sequence of Outerjoins
R a set of relations with a tree scheme R1,,Rn
a connected-prefix order of R
Proposition
FD(R) (((R1 R2) R3) ) Rn
1. Compute a connected-prefix order of R 2. Apply
outerjoins in a left-deep order
36Connected-Prefix Order of Relations
R1
R5
R2
R3
R6
R4
R7
R1
R3
R2
R7
R4
R5
R6
37Achieving Polynomial Delay
1. Compute a connected-prefix order of R 2. Apply
outerjoins in a left-deep order
R1
Problem exp. delay
Solution use iterators
38Iterators
To obtain polynomial delay, we use iterators
- Operate on top of an enumeration algorithm
- Implement next() by controlling the execution
Algorithm
Iterator
next()
39Using Iterators for Outerjoins
R1
40Outerjoins are not Always Applicable
It is not always possible to formulate a full
disjunction as a left-deep sequence of outerjoins
Rajaraman and Ullman PODS 96 Some full
disjunctions cannot be formulated as expressions
of outerjoins (i.e., with arbitrary placement of
parentheses)
41Contents
- Full Disjunctions
- Complexity
- Contributions
- Algorithms
- Algorithm NLOJ for Tree-Structured Schemes
- Algorithm PDelayFD for General Schemes
- Algorithm BiComNLOJ - Main Algorithm
- Experimental Results
- Conclusion
42About the Algorithm
- Unlike NLOJ, the next algorithm, PDelayFD, is
applicable to all schemes (and not just trees) - Algorithm PDelayFD has a polynomial delay, but
the delay is larger than that of NLOJ - Nevertheless, PDelayFD by itself is a significant
improvement over the state-of-art
43Shifting a Maximal JCC Tuple Set T
t-shifting T
T
1. Add t to T 2. Extract max. JCC subset
containing t 3. Extend to a maximal JCC set
t-shift of T
t
t
t
44Algorithm PDelayFD
Validate that the t-shift is not already in Q or
C
1. Generate a max. JCC set T0 2. Insert T0 into Q
PDelayFD(R) computes FD(R) with polynomial delay
C
Q
Repeat until Q is empty 1. Move some T from Q
to C 2. Print the join of T, padded with nulls
3. Insert into Q a t-shift of T for all
tuples t in the database
Output
45Contents
- Full Disjunctions
- Complexity
- Contributions
- Algorithms
- Algorithm NLOJ for Tree-Structured Schemes
- Algorithm PDelayFD for General Schemes
- Algorithm BiComNLOJ - Main Algorithm
- Experimental Results
- Conclusion
46NLOJ vs. PDelayFD
?
PDelayFD
NLOJ
- Shorter delays
- Less space
- Simpler to impl.
Our approach divide and conquer
47Biconnected Components
R1
R5
R2
R3
R8
R6
R4
R7
Biconnected component
A maximal subset B of relations, s.t. the scheme
graph has two (or more) disjoint paths between
every two relations of B
48Left-Deep Sequence of Outerjoins
R a set of relations
Theorem
There exists an (efficiently computable) order
B1,,Bk of the biconnected components of R,
s.t. FD(R) (((FD(B1) FD(B2)) )
FD(Bk)
Optimized Algorithm
1. Compute the biconnected components of R 2.
Compute the full disjunction of each component 3.
Apply outerjoins in a suitable order
49BiComNLOJ a Naïve Attempt
Each FD(Bi) can be exponential in the input
1. Divide R into biconnected components ?
B1,Bk in a suitable order
Non-polynomial delay!
2. Compute FD(B1),,FD(Bk) - using PDelayFD
3. Using NLOJ, compute (((FD(B1)
FD(B2)) ) FD(Bk)
Solution
50Retaining Polynomial Delay 1st Problem
R2
R6
For simplification, assume only two components
R3
R1
R7
R5
R4
R8
B1
B2
- After generating a tuple t of FD(B1), we need to
generate all tuples of FD(B2) that can join t - Non-polynomial delay if all of FD(B2) is computed
for finding these tuples! - Solution
- PDelayFD can be modified so that it generates
only those tuples of FD(B2) that can join t
Details in the proceedings
51Retaining Polynomial Delay 2nd Problem
R2
R6
For simplification, assume only two components
R3
R1
R7
R5
R4
R8
B1
B2
- The last step is to generate all tuples of FD(B2)
that cannot be joined with tuples of FD(B1) - However, this task is by itself NP-hard!
- Solution When generating all tuples of FD(B2)
that can be joined with some tuple of FD(B1), we
collect enough information for generating the
remaining tuples of FD(B2)
Details in the proceedings
52Contents
- Full Disjunctions
- Complexity
- Contributions
- Algorithms
- Algorithm NLOJ for Tree-Structured Schemes
- Algorithm PDelayFD for General Schemes
- Algorithm BiComNLOJ - Main Algorithm
- Experimental Results
- Conclusion
53Experimental Setting
Algorithms PDelayFD, BiComNLOJ (main)
IncrementalFD (CS05, state-of-art)
PosgreSQL (open source)
HW Pentium4, 1.6GHZ, 512MB RAM
- Synthetic data (randomly generated)
- Fixed schemes
54State-of-Art vs. Main Algorithm
IncrementalFD (state of art, CS05)
BiComNJOJ our main algorithm
Average Delay (msec)
Number of Tuples in each Relation
BiComNLOJ is a substantial improvement over the
state-of-art
55Division into Biconnected Components
PDelayFD (no division to b.c.c.)
BiComNJOJ our main algorithm
Average Delay (msec)
Number of Tuples in each Relation
Division reduces delays (amount depends on the
scheme)
56Behavior of Delay
Measure the delay before each generated tuple
IncrementalFD (state of art, CS05)
BiComNJOJ our main algorithm
Delay (msec)
Tuple Number
While IncrementalFD has a slowdown, the delay of
BiComNLOJ remains almost constant
57Contents
- Full Disjunctions
- Complexity
- Contributions
- Algorithms
- Algorithm NLOJ for Tree-Structured Schemes
- Algorithm PDelayFD for General Schemes
- Algorithm BiComNLOJ - Main Algorithm
- Experimental Results
- Conclusion
58Summary
Full Disjunction An associative extension of
the outerjoin operator to an arbitrary number of
relations
3 Algorithms for computing FD
NLOJ Nested-Loop Outerjoin Tree-Structured Schemes
PDelayFD Polynomial-Delay Full Disjunction Genera
l Schemes
BiComNLOJ Combine first 2, deploy div. into
biconnected components General Schemes
59Contributions
- Substantial improvement of evaluation time over
the state-of-art - Proved theoretically and experimentally
- Full disjunctions can be computed with polynomial
delay and in time linear in the output size - Optimization techniques for computing FDs
- Implementation within PostgreSQL (ongoing)
- Incorporating our algorithms into an SQL
optimizer - E.g., some operators can be pushed through the FD
- Not discussed here, appears in the proceedings
60Thank you.
Questions?