Database Systems (?????)

About This Presentation

Title:

Database Systems (?????)

Description:

This is an over-simplification, but good enough for now. ... Nested block is optimized independently (sometimes just evaluated once) ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 34

Provided by: haohu1

Category:

more less

Transcript and Presenter's Notes

Title: Database Systems (?????)

1
Database Systems(?????)

December 13, 2004
Chapter 15
By Hao-hua Chu (???)

2
Announcement

Assignment 9 is due next Thursday.
Read chapter 16 for next lecture.

3
Cool Ubicomp ProjectHyperdragging (Sony, 1999)

Not enough working surface on your computer
screen .
No shared working surface when collaborating with
other people .

4
A Typical Query Optimizer

Chapter 15

5
How does a query optimizer work in general?

Decompose a SQL query (can have nested queries)
into query blocks (without nested queries).
Translate each query block into relational
algebra expressions.
Optimize the relational algebra expression, one
query block at a time
Enumerate a subset of possible evaluation plans
(also call explore the plan space)
Estimate the cost (disk I/Os) of each explored
plan using system catalogs and statistics
Pick the plan with the least estimated cost

6
Decompose a Query into Query Blocks (Example)
SELECT S.sid, MIN(R.day) FROM Sailors S,
Reserves R, Boats B WHERE S.sidR.sid AND
R.bidB.bid AND B.colorred AND
S.rating (SELECT MAX(S2.rating)
FROM Sailors
S2) GROUP BY S.sid HAVING COUNT() gt 1
SELECT S.sid, MIN(R.day) FROM Sailors S,
Reserves R, Boats B WHERE S.sidR.sid AND
R.bidB.bid AND B.colorred
AND S.rating Reference to the
nested block GROUP BY S.sid HAVING COUNT() gt 1
SELECT MAX(S2.rating) FROM Sailors S2
Nested block
Outer block
7
Optimizing A Query Block

Query blocks are optimized one block at a time.
Nested blocks are usually treated as calls to a
subroutine, made once per tuple in the outer
block.
This is an over-simplification, but good enough
for now.
To estimate I/O cost, the optimizer estimates the
size of (intermediate) results.
System catalogs about the lengths of (projected)
fields
Statistics about referenced relationships (file
size tuples)
Available access methods (indexes selection
conditions), for each relationship in from clause

8
Translate Query Block into Relational Algebra
Expr (1)
SELECT S.sid, MIN(R.day) FROM Sailors S,
Reserves R, Boats B WHERE S.sidR.sid AND
R.bidB.bid AND B.colorred
AND S.rating Reference to the
nested block GROUP BY S.sid HAVING COUNT() gt 1

pS.sid,MIN(R.day) (
HAVING COUNT()gt2
GROUP BY S.sid (
s S.sidR.sid AND R.bidB.bid
AND B.colorred
AND S.ratingvalue_from_nested_block (
Sailors?Reserves ?Boats))))

Assume that GROUP BY and HAVING are also
operators.

9
Translate Query Block (2)
pS.sid,MIN(R.day) ( HAVING COUNT()gt2 GROUP
BY S.sid ( s S.sidR.sid AND R.bidB.bid
AND B.colorred AND
S.ratingvalue_from_nested_block (
Sailors?Reserves ?Boats))))

Try to simplify query block further into sp?
expression.
Why do this? Easier to find equivalent sp?
expressions (alternative plans), and they may
have cheaper costs.
How about GROUP BY and HAVING?
They are carried out after the result of sp?.
Add attributes specified in GROUP BY and HAVING
into projection list.

pS.sid,MIN(R.day) ( HAVING COUNT()gt2 GROUP
BY S.sid (sp? expression ) pS.sid,R.day ( s
S.sidR.sid AND R.bidB.bid
AND B.colorred AND S.rating
value_from_nested_block ( Sailors ? Reserves ?
Boats))))))
10
Relational Algebra Tree

Represent a plan, which is a relational algebra
(RA) expression, as a RA tree.

sname
sbid100 rating gt 5
SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
sidsid
Sailors
Reserves
11
Estimate Cost of a Plan

For each enumerated plan, estimate its cost
Each node in the tree involves a relational
operator. We must estimate the cost of
evaluating a relational operator.
Size of inputs (pages), table statistics
(selection conditions), available indexes, and
chosen algorithms for evaluating operators
(Chapter 14).
For each node in the tree, we need to estimate
the size of the results and whether the results
are sorted or not.
Since the results are inputs to the upper node,
they are used to estimate the cost of the upper
node (operator).

12
Notes on Query Optimizer

Cost estimation is only an approximation.
Consider the cost of Disk I/Os.
Plan Space
Too large -gt too many possible plans to enumerate
and too expensive to estimate the costs for all
of them, must be restricted.
Consider only the space of left-deep plans. Why?
Left-deep plans allow output of each operator to
be pipelined into the next (parent) operator
without physically storing it in a temporary
relation.
Avoid cartesian products.

13
Left Deep Join Trees

Restrict the plan search space to only left-deep
join trees.
As the number of joins increases, the number of
alternative plans grows rapidly we need to
restrict the search space.
Left-deep trees allow us to generate all fully
pipelined plans (if we choose so).
Intermediate results not written to temporary
files.
Not all left-deep trees are fully pipelined,
depending on the choice of join algorithm (e.g.,
Sort-Merge join).

?
?
?
D
?
?
?
C
B
A
D
B
A
C
14
Estimating Result Sizes
SELECT attribute list FROM relation
list WHERE term_1 AND term_2 AND term_3 AND
term_n

How to estimate the size of result by an operator
on given inputs?
Use information from system catalogs and
statistics.
For each term, find tuple reduction factors (
expected input tuples / expected qualified
tuples).

Column value 1 / NKeys(I), if column is index(I), or 10.
Column1 Column2 1 / MAX (NKeys(I1), NKeys(I2)), or 1 / NKeys(I), or 10.
Column gt value (High(I) value) / (High(I) Low(I)), or lt50.
Column in (list of values) (reduction factor for column value) values
Very Rough Estimation of Reduction Factors
15
Improved Statistics Histograms

The rough estimation assumes uniform
distributions of values.
What if that assumption is not true? For more
accurate estimations, use histograms.
Histogram is a data structure to approximate data
distribution.
Term children with (age gt 3) result size 2
tuples.

7
5
2
1
1
ages
1
2
3
4
5
16
Relational Algebra Equivalences

Allow us to choose different join orders and to
push selections and projections ahead of joins.
Selections (cascading of selections)
Break a selection condition into many smaller
selections.
Combine several selections into one selection.
s c1 AND c2 AND cn (R) s c1 (s c2 ( s cn
(R)) ))
Selection (commutative)
Test conditions in either order.
s c1 (s c2 (R)) s c2 (s c1 (R))

17
Relational Algebra Equivalences (Projections,
Joins, and Cross-Products)

Projections (cascading projections)
Successively eliminating columns is same as
eliminating all but the columns of final
projection.
pa1.(R) pa1.( pa2.((pan R)) ..)), where a1 is
a set of attributes of relation R, and ai is a
subset of ai1
psid.(R) psid.( psid, bid (Reserves))
Joins and Cross-Products (commutative)
Freedom to choose inner or outer relations
R?S S?R
Joins and Cross-Products (associative)
Join pairs of relations in any order
R?(S ?T) (R?S) ?T

18
Relational Algebra Equivalences Involving Two or
More Operators (1)

Commute a selection with a projection
when the selection condition c involves only
attributes retained by the projection a.
p a (s c (R)) s c (p a (R))
p sid (s sid10 (R)) s sid10 (p sid (R))
p sid (s bid10 (R)) ? s bid10 (p sid (R))
Combine a selection with a cross-product to form
a join.
Push a selection into a cross-product (join)
When the selection condition c involves only
attributes of one of the arguments to the
cross-product (join).
s c (R ? S ) s c (R) ? S
s R.bid10 (R ? S ) s R.bid10 (R) ? S
s S.rating10 (R ? S ) ? s S.rating10 (R) ? S

19
Relational Algebra Equivalences Involving Two or
More Operators (2)

Push a selection into a cross-product (join)
Replace a selection with cascading selections,
and commute selections.
c1 involves attributes of both R S (c2 attrs of
R, c3 attrs of S).
s c (R ? S ) s c1 c2 c3 (R ? S )
s c1 (s c2 (s c3 (R ? S )))
s c1 (s c2 (R) ? s c3 (S ))
s sid10, rnameJane snamePaul (R ? S )
s sid10 (s snameJane (R) ? s rnamePaul
(S ))
Push a projection into a cross-product (join)
when subsets (a1,a2) of projection attribute a
involves only attributes of one of the arguments
to the cross-product (join).
Same as to push a selection with cross-product
(join)
p a (R ? S ) p a1 (R) ? p a2 (S)
p R.sid, S.sname (R ? S ) p R.sid (R) ? p
S.sname (S)

20
Relational Algebra Equivalences Involving Two or
More Operators (3)

Push a projection into a join
When a1 is subset of a in R, a2 is a subset of a
in S, and c is in a.
p a (R ?c S ) p a1 (R) ?c p a2 (S)
p R.sid, S.sname, S.sid (R ? R.sidS.sid S )
(p R.sid (R) ) ? R.sidS.sid (p S.sname,S.sid
(S) )
Push a projection into a join if
a1 is subset of R that appear in a and c, and a2
is a subset of S that appear in a and c
p a (R c S ) p a ( p a1 (R) c p a2
(S))
p R.sid, S.sname (R R.sidS.sid S )
p R.sid,S.sname ( p R.sid,R.sid (R)
R.sidS.sid p S.sname,S.sid (S))

21
Enumeration of Alternative Plans

There are two main cases
Single-relation plans (SELECT FROM Sailors S
)
Multiple-relation plans (SELECT FROM Sailors
S, Reserves R )
For queries over a single relation, queries
consist of a combination of selections,
projections, and aggregate ops (no joins).
The general strategy has two parts
For selections, consider each available access
path (file scan / index) and pick the one with
the least estimated cost.
Projections and aggregations are carried out
together with selections.
For example, if an index is used for a selection,
projection is done for each retrieved tuple, and
the resulting tuples are pipelined into the
aggregate computation.

22
Single-Relation Queries

Example
For each rating greater than 5, print the rating
and the number of 20-year old sailors with that
rating, provided that there are at last two such
sailors with different names

Plan without indexes
Scan Sailors, apply selections and projections.
Write out tuples.
Sort tuples based on S.rating (GROUP BY).
Apply HAVING on the fly at the last sorting step.

23
Single-Relation Queries Plans Utilizing an Index

Single-Index Access Path
If several indexes match selection condition(s),
pick the best access path. Apply projections and
non-primary selections. Sort by grouping
attributes. Do aggregations.
Multiple-Index Access Path
If several indexes match selection condition(s),
use them to retrieve sets of RIDs. Take
intersection of RID sets. Sort resulting RID by
page ID. Retrieve all tuples on the same page,
while applying projections, selections. Sort by
grouping attributes. Do aggregations.
Sorted Index Access Path
If GROUP BY attributes matches a tree index, use
the index to retrieve tuples in order. Then
apply selections, projections, and aggregation
operations.
Index-Only Access Path
If all attributes mentioned in (SELECT, WHERE,
GROUP BY, HAVING) are in the index, can do an
index-only scan.

24
Example Using Single-Index Access Path

B tree index on rating, hash index on age, and
B tree index on ltrating, sname, agegt

Retrieve Sailors tuple (age20) using hash index
on age. (most selective path)
For retrieved tuple, apply (rating gt5 ) and
projection out attributes (rating and sname).
Write results to temp table.
Sort temp table on rating. At the last sorting
step, apply HAVING and final projection.

25
Queries Over Multiple Relations

The general strategy has three parts
Consider and enumerate only left-deep join trees.
Why?
Restrict the search space.
Left-deep trees allow us to generate all fully
pipelined plans.
Consider selections and projections as early as
possible (Push selections and projections into
lower joins)
Estimate the cost for each left-deep plan. Pick
the one with the lowest cost.

26
Enumeration of Left-Deep Plans

Left-deep plans differ from each other in
The order of relations
The access method for each relation
The join algorithm for each join operation
We have discussed how to enumerate different
access methods and estimate their costs for one
relation.
Different join algorithms (e.g., nested loop
join, sort-merge join, hash join, ) and their
cost analysis are discussed in Chapter 14.

27
Plan Enumeration Algorithm

Enumerated using N passes (if N relations
joined)
Pass 1 Find best 1-relation plan for each
relation.
Push selection terms projection attributes to
that 1-relation (using equivalences)
Consider all different access paths, and pick the
one with lowest cost. This is same algorithm as
1-relation query.
Pass 2 Find best way to join result of each
1-relation plan (as outer) to another relation.
(All 2-relation plans.)
Again push selection terms projection
attributes into the inner relation using
equivalences.
Try to pipeline the selected/projected tuples
from the inner relation into join operation with
outer relation. (Sometimes cannot, e.g.,
sort-merge join)
Pass N Find best way to join result of a
(N-1)-relation plan (as outer) to the Nth
relation. (All N-relation plans.)

28
Plan Enumeration Algorithm Illustration
Pass 1 Same as Single-Relation Query
A
B
C
pipelined into join
consider alt join algs
push sel/proj
Pass 2
outer
inner
Pass 3
29
Enumeration of Plans (Contd.)

For each subset of relations, retain only
Cheapest plan overall, plus
Cheapest plan for each interesting order of the
tuples.
ORDER BY, GROUP BY, aggregates etc. handled as a
final step. If tuples going into them are not
sorted, apply additional sorting.
Consider left-deep trees, this approach is still
exponential in the of relations.
N relations -gt O( N! ) plans

30
Example
Sailors B tree index on rating Hash index
on sid Reserves B tree index on bid

Pass1
Sailors B tree matches ratinggt5, and is
probably cheapest. However, if this selection is
expected to retrieve a lot of tuples, and index
is unclustered, file scan may be cheaper.
Still, B tree plan kept (because tuples are in
rating order).
Reserves B tree on bid matches bid500
cheapest.

Pass 2
We consider each plan retained from Pass 1 as the
outer, and consider how to join it with the
(only) other relation.
Reserves as outer (better) hash index can be
used to get Sailors tuples that match sid outer
tuples sid value (hash join)
Sailors as outer (worse) B tree index can be
used to get Reserves tuples that match bid. Then
pipeline tuples into join, or write them out for
sort-merge join.

31
Nested Subqueries

Nested block is optimized independently
(sometimes just evaluated once).
In general, nested queries are dealt with using
some form of nested loops evaluation.
Main query as the outer loop, the subquery as the
inner loop
This is necessary for correlated queries below (S
is used both in the main query and the
subquery.).

SELECT S.sname FROM Sailors S WHERE S.rating
(SELECT MAX(S2.ratings) FROM
Sailors S2)
SELECT S.sname FROM Sailors S WHERE S.sid IN
(SELECT R.sid FROM Reserves R WHERE
R.bid103)
SELECT S.sname FROM Sailors S WHERE EXISTS
(SELECT FROM Reserves R WHERE
R.bid103 AND S.sidR.rid)
32
Summary

Query optimization is an important task in a
relational DBMS.
Must understand optimization in order to
understand the performance impact of a given
database design (relations, indexes) on a
workload (set of queries).
Two parts to optimizing a query
Consider a set of alternative plans.
Must prune search space typically, left-deep
plans only.
Must estimate cost of each plan that is
considered.
Must estimate size of result and cost for each
plan node.
Key issues Statistics, indexes, choice of
operator algorithms.

33
Summary (Contd.)

Single-relation queries
All access paths considered, cheapest is chosen.
Multiple-relation queries
All single-relation plans are first enumerated.
Selections/projections considered as early as
possible.
Next, for each 1-relation plan, all different
ways of joining another relation (as inner) are
considered.
Next, for each 2-relation plan that is
retained, all ways of joining another relation
(as inner) are considered, etc.
At each level, for each subset of relations, only
best plan for each interesting order of tuples is
retained.

Write a Comment

User Comments (0)

About PowerShow.com

Database Systems (?????) - PowerPoint PPT Presentation

Database Systems (?????)

This is an over-simplification, but good enough for now. ... Nested block is optimized independently (sometimes just evaluated once) ... – PowerPoint PPT presentation