Greedy Algo. for Selecting a Join Order - PowerPoint PPT Presentation

About This Presentation

Title:

Greedy Algo. for Selecting a Join Order

Description:

Greedy Algo. for Selecting a Join Order The – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 13

Provided by: Compute258

Category:

more less

Transcript and Presenter's Notes

Title: Greedy Algo. for Selecting a Join Order

1
Greedy Algo. for Selecting a Join Order

The "greediness" is based on the idea that we
want to keep the intermediate relations as small
as possible at each level of the tree.
BASIS Start with the pair of relations whose
estimated join size is smallest. The join of
these relations becomes the current tree.
INDUCTION Find, among all those relations not
yet included in the current tree, the relation
that, when joined with the current tree, yields
the relation of smallest estimated size.
The new current tree has the old current tree
as its left argument and the selected relation
as its right argument.

2
Example

The basis step is to find the pair of relations
that have the smallest join.
This honor goes to the join T ?? U, with a cost
of 1000. Thus, T ?? U is the "current tree."
We now consider whether to join R or S into the
tree next.
We compare the sizes of (T ?? U) ?? R and (T ??
U)??S.
The latter, with a size of 2000 is better than
the former, with a size of 10,000. Thus, we pick
as the new current tree (T ?? U)??S.
Now there is no choice we must join R at the
last step, leaving us with a total cost of 3000,
the sum of the sizes of the two intermediate
relations.

3
Pipelining Versus Materialization

Materialization The naive way to execute a query
plan is to order the operations appropriately
An operation is not performed until the
argument(s) below it have been performed, and
Store the result of each operation on disk until
it is needed by another operation. This strategy
is
Pipelining A more subtle, and generally more
efficient, way to execute a query plan is to
interleave the execution of several operations.
The tuples produced by one operation are passed
directly to the operation that uses it, without
ever storing the intermediate tuples on disk.
Typically is implemented by a network of
iterators.

4
Pipelining Example (I)

Let us consider physical query plans for the
expression
(R(w,x) ?? S(x,y)) ?? U(y,z)
Assumptions
R occupies 5000 blocks S and U each occupy
10,000 blocks.
The intermediate result R ?? S occupies k blocks
for some k.
We can estimate k, based on the number of
x-values in R and S and the size of (w,x,y)
tuples compared to the (w,x) tuples of R and the
(x,y) tuples of S.
However, we want to see what happens as k varies,
so we leave this constant open.
Both joins will be implemented as hash-joins,
either one-pass or two-pass, depending on k.
There are 101 buffers available.

5
Example (II)
6
Example (III)

First, consider the join R ?? S. Neither relation
fits in main memory, so we need a two-pass
hash-join.
If the smaller relation R is partitioned into the
maximum-possible 100 buckets on the first pass,
then each bucket for R occupies 50 blocks.
If R's buckets have 50 blocks, then the second
pass of the hash-join R ?? S uses 51 buffers,
leaving 50 buffers to use for the join of the
result of R ?? S with U.
Now,suppose that k ? 49 that is, the result of R
?? S occupies at most 49 blocks.
Then we can pipeline the result of R ?? S into 49
buffers, organize them for lookup as a hash
table, and we have one buffer left to read each
block of T in turn.
We may thus execute the second join as a one-pass
join.
The total number of disk I/O's is
45,000 to perform the two-pass hash join of R and
S.
10,000 to read U in the one-pass hash-join of (R
?? S) ?? U.
The total is 55,000 disk I/O's.

7
Example (IV)

Now, suppose k gt 49, but k lt 5000. We can still
pipeline the result R ?? S, but we need to use
another strategy, in which this relation is
joined with U in a 50-bucket, two-pass hash-join.
Before we start on R ?? S, we hash U into 50
buckets of 200 blocks.
Next, we perform a two-pass hash join of R and S
using 51 buckets as before, but as each tuple of
R ?? S is generated, we place it in one of the 50
remaining buffers that is used to help form the
50 buckets for the join of R??S with U.
Finally, we join R ?? S with U bucket by bucket.
Since k lt 5000, the buckets of R ?? S will be of
size at most 100 blocks, so this join is
feasible.
The fact that buckets of U are of size 200
blocks is not a problem.
We are using buckets of R ?? S as the build
relation and buckets of U as the probe relation
in the one-pass joins of buckets.
The number of disk I/O's for this pipelined join
is
a) 20,000 to read U and write its tuples into
buckets.
b) 45,000 to perform the two-pass hash-join R ??
S.
c) k to write out the buckets of R ?? S.
d) k 10,000 to read the buckets of R ?? S and
U in the final join.
The total cost is thus 75,000 2k.

8
Example (V)

Last, let us consider what happens when k gt 5000.
Now, we cannot perform a two-pass join in the 50
buffers available if the result of R ?? S is
pipelined. So,
Compute R ?? S using a two-pass hash join and
materialize the result on disk.
Join R ?? S with U, also using a two-pass
hash-join.
Note that since B(U) 10,000, we can perform a
two-pass hash-join using 100 buckets regardless
of how large k is. Technically, U should appear
as the left argument of its join if we decide to
make U the build relation for the hash join.
Number of disk I/O's for this plan is
45,000 for the two-pass join of R and S.
k to store R ?? S on disk.
30,000 3k for the two-pass hash-join of U with
R ?? S.
Total cost is thus 75,000 4k

9
Example (VI)
10
Notation for Physical Plans
It can also be SortScan(R,L) if a sort based join
is preferred.
11
Notation for Physical Plans
12
Notation for Physical Plans

Write a Comment

User Comments (0)