Title: Database Systems (?????)
1Database Systems(?????)
- December 13, 2004
- Chapter 15
- By Hao-hua Chu (???)
2Announcement
- Assignment 9 is due next Thursday.
- Read chapter 16 for next lecture.
3Cool Ubicomp ProjectHyperdragging (Sony, 1999)
- Not enough working surface on your computer
screen . - No shared working surface when collaborating with
other people .
4A Typical Query Optimizer
5How does a query optimizer work in general?
- Decompose a SQL query (can have nested queries)
into query blocks (without nested queries). - Translate each query block into relational
algebra expressions. - Optimize the relational algebra expression, one
query block at a time - Enumerate a subset of possible evaluation plans
(also call explore the plan space) - Estimate the cost (disk I/Os) of each explored
plan using system catalogs and statistics - Pick the plan with the least estimated cost
6Decompose a Query into Query Blocks (Example)
SELECT S.sid, MIN(R.day) FROM Sailors S,
Reserves R, Boats B WHERE S.sidR.sid AND
R.bidB.bid AND B.colorred AND
S.rating (SELECT MAX(S2.rating)
FROM Sailors
S2) GROUP BY S.sid HAVING COUNT() gt 1
SELECT S.sid, MIN(R.day) FROM Sailors S,
Reserves R, Boats B WHERE S.sidR.sid AND
R.bidB.bid AND B.colorred
AND S.rating Reference to the
nested block GROUP BY S.sid HAVING COUNT() gt 1
SELECT MAX(S2.rating) FROM Sailors S2
Nested block
Outer block
7Optimizing A Query Block
- Query blocks are optimized one block at a time.
- Nested blocks are usually treated as calls to a
subroutine, made once per tuple in the outer
block. - This is an over-simplification, but good enough
for now. - To estimate I/O cost, the optimizer estimates the
size of (intermediate) results. - System catalogs about the lengths of (projected)
fields - Statistics about referenced relationships (file
size tuples) - Available access methods (indexes selection
conditions), for each relationship in from clause
8Translate Query Block into Relational Algebra
Expr (1)
SELECT S.sid, MIN(R.day) FROM Sailors S,
Reserves R, Boats B WHERE S.sidR.sid AND
R.bidB.bid AND B.colorred
AND S.rating Reference to the
nested block GROUP BY S.sid HAVING COUNT() gt 1
- pS.sid,MIN(R.day) (
- HAVING COUNT()gt2
- GROUP BY S.sid (
- s S.sidR.sid AND R.bidB.bid
- AND B.colorred
- AND S.ratingvalue_from_nested_block (
- Sailors?Reserves ?Boats))))
- Assume that GROUP BY and HAVING are also
operators.
9Translate Query Block (2)
pS.sid,MIN(R.day) ( HAVING COUNT()gt2 GROUP
BY S.sid ( s S.sidR.sid AND R.bidB.bid
AND B.colorred AND
S.ratingvalue_from_nested_block (
Sailors?Reserves ?Boats))))
- Try to simplify query block further into sp?
expression. - Why do this? Easier to find equivalent sp?
expressions (alternative plans), and they may
have cheaper costs. - How about GROUP BY and HAVING?
- They are carried out after the result of sp?.
- Add attributes specified in GROUP BY and HAVING
into projection list.
pS.sid,MIN(R.day) ( HAVING COUNT()gt2 GROUP
BY S.sid (sp? expression ) pS.sid,R.day ( s
S.sidR.sid AND R.bidB.bid
AND B.colorred AND S.rating
value_from_nested_block ( Sailors ? Reserves ?
Boats))))))
10Relational Algebra Tree
- Represent a plan, which is a relational algebra
(RA) expression, as a RA tree.
sname
sbid100 rating gt 5
SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
sidsid
Sailors
Reserves
11Estimate Cost of a Plan
- For each enumerated plan, estimate its cost
- Each node in the tree involves a relational
operator. We must estimate the cost of
evaluating a relational operator. - Size of inputs (pages), table statistics
(selection conditions), available indexes, and
chosen algorithms for evaluating operators
(Chapter 14). - For each node in the tree, we need to estimate
the size of the results and whether the results
are sorted or not. - Since the results are inputs to the upper node,
they are used to estimate the cost of the upper
node (operator).
12Notes on Query Optimizer
- Cost estimation is only an approximation.
- Consider the cost of Disk I/Os.
- Plan Space
- Too large -gt too many possible plans to enumerate
and too expensive to estimate the costs for all
of them, must be restricted. - Consider only the space of left-deep plans. Why?
- Left-deep plans allow output of each operator to
be pipelined into the next (parent) operator
without physically storing it in a temporary
relation. - Avoid cartesian products.
13Left Deep Join Trees
- Restrict the plan search space to only left-deep
join trees. - As the number of joins increases, the number of
alternative plans grows rapidly we need to
restrict the search space. - Left-deep trees allow us to generate all fully
pipelined plans (if we choose so). - Intermediate results not written to temporary
files. - Not all left-deep trees are fully pipelined,
depending on the choice of join algorithm (e.g.,
Sort-Merge join).
?
?
?
D
?
?
?
C
B
A
D
B
A
C
14Estimating Result Sizes
SELECT attribute list FROM relation
list WHERE term_1 AND term_2 AND term_3 AND
term_n
- How to estimate the size of result by an operator
on given inputs? - Use information from system catalogs and
statistics. - For each term, find tuple reduction factors (
expected input tuples / expected qualified
tuples).
Column value 1 / NKeys(I), if column is index(I), or 10.
Column1 Column2 1 / MAX (NKeys(I1), NKeys(I2)), or 1 / NKeys(I), or 10.
Column gt value (High(I) value) / (High(I) Low(I)), or lt50.
Column in (list of values) (reduction factor for column value) values
Very Rough Estimation of Reduction Factors
15Improved Statistics Histograms
- The rough estimation assumes uniform
distributions of values. - What if that assumption is not true? For more
accurate estimations, use histograms. - Histogram is a data structure to approximate data
distribution. - Term children with (age gt 3) result size 2
tuples.
7
5
2
1
1
ages
1
2
3
4
5
16Relational Algebra Equivalences
- Allow us to choose different join orders and to
push selections and projections ahead of joins. - Selections (cascading of selections)
- Break a selection condition into many smaller
selections. - Combine several selections into one selection.
- s c1 AND c2 AND cn (R) s c1 (s c2 ( s cn
(R)) )) - Selection (commutative)
- Test conditions in either order.
- s c1 (s c2 (R)) s c2 (s c1 (R))
17Relational Algebra Equivalences (Projections,
Joins, and Cross-Products)
- Projections (cascading projections)
- Successively eliminating columns is same as
eliminating all but the columns of final
projection. - pa1.(R) pa1.( pa2.((pan R)) ..)), where a1 is
a set of attributes of relation R, and ai is a
subset of ai1 - psid.(R) psid.( psid, bid (Reserves))
- Joins and Cross-Products (commutative)
- Freedom to choose inner or outer relations
- R?S S?R
- Joins and Cross-Products (associative)
- Join pairs of relations in any order
- R?(S ?T) (R?S) ?T
18Relational Algebra Equivalences Involving Two or
More Operators (1)
- Commute a selection with a projection
- when the selection condition c involves only
attributes retained by the projection a. - p a (s c (R)) s c (p a (R))
- p sid (s sid10 (R)) s sid10 (p sid (R))
- p sid (s bid10 (R)) ? s bid10 (p sid (R))
- Combine a selection with a cross-product to form
a join. - Push a selection into a cross-product (join)
- When the selection condition c involves only
attributes of one of the arguments to the
cross-product (join). - s c (R ? S ) s c (R) ? S
- s R.bid10 (R ? S ) s R.bid10 (R) ? S
- s S.rating10 (R ? S ) ? s S.rating10 (R) ? S
19Relational Algebra Equivalences Involving Two or
More Operators (2)
- Push a selection into a cross-product (join)
- Replace a selection with cascading selections,
and commute selections. - c1 involves attributes of both R S (c2 attrs of
R, c3 attrs of S). - s c (R ? S ) s c1 c2 c3 (R ? S )
- s c1 (s c2 (s c3 (R ? S )))
- s c1 (s c2 (R) ? s c3 (S ))
- s sid10, rnameJane snamePaul (R ? S )
- s sid10 (s snameJane (R) ? s rnamePaul
(S )) - Push a projection into a cross-product (join)
- when subsets (a1,a2) of projection attribute a
involves only attributes of one of the arguments
to the cross-product (join). - Same as to push a selection with cross-product
(join) - p a (R ? S ) p a1 (R) ? p a2 (S)
- p R.sid, S.sname (R ? S ) p R.sid (R) ? p
S.sname (S)
20Relational Algebra Equivalences Involving Two or
More Operators (3)
- Push a projection into a join
- When a1 is subset of a in R, a2 is a subset of a
in S, and c is in a. - p a (R ?c S ) p a1 (R) ?c p a2 (S)
- p R.sid, S.sname, S.sid (R ? R.sidS.sid S )
- (p R.sid (R) ) ? R.sidS.sid (p S.sname,S.sid
(S) ) - Push a projection into a join if
- a1 is subset of R that appear in a and c, and a2
is a subset of S that appear in a and c - p a (R c S ) p a ( p a1 (R) c p a2
(S)) - p R.sid, S.sname (R R.sidS.sid S )
- p R.sid,S.sname ( p R.sid,R.sid (R)
R.sidS.sid p S.sname,S.sid (S))
21Enumeration of Alternative Plans
- There are two main cases
- Single-relation plans (SELECT FROM Sailors S
) - Multiple-relation plans (SELECT FROM Sailors
S, Reserves R ) - For queries over a single relation, queries
consist of a combination of selections,
projections, and aggregate ops (no joins). - The general strategy has two parts
- For selections, consider each available access
path (file scan / index) and pick the one with
the least estimated cost. - Projections and aggregations are carried out
together with selections. - For example, if an index is used for a selection,
projection is done for each retrieved tuple, and
the resulting tuples are pipelined into the
aggregate computation.
22Single-Relation Queries
- Example
- For each rating greater than 5, print the rating
and the number of 20-year old sailors with that
rating, provided that there are at last two such
sailors with different names
- Plan without indexes
- Scan Sailors, apply selections and projections.
- Write out tuples.
- Sort tuples based on S.rating (GROUP BY).
- Apply HAVING on the fly at the last sorting step.
23Single-Relation Queries Plans Utilizing an Index
- Single-Index Access Path
- If several indexes match selection condition(s),
pick the best access path. Apply projections and
non-primary selections. Sort by grouping
attributes. Do aggregations. - Multiple-Index Access Path
- If several indexes match selection condition(s),
use them to retrieve sets of RIDs. Take
intersection of RID sets. Sort resulting RID by
page ID. Retrieve all tuples on the same page,
while applying projections, selections. Sort by
grouping attributes. Do aggregations. - Sorted Index Access Path
- If GROUP BY attributes matches a tree index, use
the index to retrieve tuples in order. Then
apply selections, projections, and aggregation
operations. - Index-Only Access Path
- If all attributes mentioned in (SELECT, WHERE,
GROUP BY, HAVING) are in the index, can do an
index-only scan.
24Example Using Single-Index Access Path
- B tree index on rating, hash index on age, and
B tree index on ltrating, sname, agegt
- Retrieve Sailors tuple (age20) using hash index
on age. (most selective path) - For retrieved tuple, apply (rating gt5 ) and
projection out attributes (rating and sname). - Write results to temp table.
- Sort temp table on rating. At the last sorting
step, apply HAVING and final projection.
25Queries Over Multiple Relations
- The general strategy has three parts
- Consider and enumerate only left-deep join trees.
Why? - Restrict the search space.
- Left-deep trees allow us to generate all fully
pipelined plans. - Consider selections and projections as early as
possible (Push selections and projections into
lower joins) - Estimate the cost for each left-deep plan. Pick
the one with the lowest cost.
26Enumeration of Left-Deep Plans
- Left-deep plans differ from each other in
- The order of relations
- The access method for each relation
- The join algorithm for each join operation
- We have discussed how to enumerate different
access methods and estimate their costs for one
relation. - Different join algorithms (e.g., nested loop
join, sort-merge join, hash join, ) and their
cost analysis are discussed in Chapter 14.
27Plan Enumeration Algorithm
- Enumerated using N passes (if N relations
joined) - Pass 1 Find best 1-relation plan for each
relation. - Push selection terms projection attributes to
that 1-relation (using equivalences) - Consider all different access paths, and pick the
one with lowest cost. This is same algorithm as
1-relation query. - Pass 2 Find best way to join result of each
1-relation plan (as outer) to another relation.
(All 2-relation plans.) - Again push selection terms projection
attributes into the inner relation using
equivalences. - Try to pipeline the selected/projected tuples
from the inner relation into join operation with
outer relation. (Sometimes cannot, e.g.,
sort-merge join) - Pass N Find best way to join result of a
(N-1)-relation plan (as outer) to the Nth
relation. (All N-relation plans.)
28Plan Enumeration Algorithm Illustration
Pass 1 Same as Single-Relation Query
A
B
C
pipelined into join
consider alt join algs
push sel/proj
Pass 2
outer
inner
Pass 3
29Enumeration of Plans (Contd.)
- For each subset of relations, retain only
- Cheapest plan overall, plus
- Cheapest plan for each interesting order of the
tuples. - ORDER BY, GROUP BY, aggregates etc. handled as a
final step. If tuples going into them are not
sorted, apply additional sorting. - Consider left-deep trees, this approach is still
exponential in the of relations. - N relations -gt O( N! ) plans
30Example
Sailors B tree index on rating Hash index
on sid Reserves B tree index on bid
- Pass1
- Sailors B tree matches ratinggt5, and is
probably cheapest. However, if this selection is
expected to retrieve a lot of tuples, and index
is unclustered, file scan may be cheaper. - Still, B tree plan kept (because tuples are in
rating order). - Reserves B tree on bid matches bid500
cheapest.
- Pass 2
- We consider each plan retained from Pass 1 as the
outer, and consider how to join it with the
(only) other relation. - Reserves as outer (better) hash index can be
used to get Sailors tuples that match sid outer
tuples sid value (hash join) - Sailors as outer (worse) B tree index can be
used to get Reserves tuples that match bid. Then
pipeline tuples into join, or write them out for
sort-merge join.
31Nested Subqueries
- Nested block is optimized independently
(sometimes just evaluated once). - In general, nested queries are dealt with using
some form of nested loops evaluation. - Main query as the outer loop, the subquery as the
inner loop - This is necessary for correlated queries below (S
is used both in the main query and the
subquery.).
SELECT S.sname FROM Sailors S WHERE S.rating
(SELECT MAX(S2.ratings) FROM
Sailors S2)
SELECT S.sname FROM Sailors S WHERE S.sid IN
(SELECT R.sid FROM Reserves R WHERE
R.bid103)
SELECT S.sname FROM Sailors S WHERE EXISTS
(SELECT FROM Reserves R WHERE
R.bid103 AND S.sidR.rid)
32Summary
- Query optimization is an important task in a
relational DBMS. - Must understand optimization in order to
understand the performance impact of a given
database design (relations, indexes) on a
workload (set of queries). - Two parts to optimizing a query
- Consider a set of alternative plans.
- Must prune search space typically, left-deep
plans only. - Must estimate cost of each plan that is
considered. - Must estimate size of result and cost for each
plan node. - Key issues Statistics, indexes, choice of
operator algorithms.
33Summary (Contd.)
- Single-relation queries
- All access paths considered, cheapest is chosen.
- Multiple-relation queries
- All single-relation plans are first enumerated.
- Selections/projections considered as early as
possible. - Next, for each 1-relation plan, all different
ways of joining another relation (as inner) are
considered. - Next, for each 2-relation plan that is
retained, all ways of joining another relation
(as inner) are considered, etc. - At each level, for each subset of relations, only
best plan for each interesting order of tuples is
retained.