Database Systems (?????) - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Database Systems (?????)

Description:

This is an over-simplification, but good enough for now. ... Nested block is optimized independently (sometimes just evaluated once) ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 34
Provided by: haohu1
Category:

less

Transcript and Presenter's Notes

Title: Database Systems (?????)


1
Database Systems(?????)
  • December 13, 2004
  • Chapter 15
  • By Hao-hua Chu (???)

2
Announcement
  • Assignment 9 is due next Thursday.
  • Read chapter 16 for next lecture.

3
Cool Ubicomp ProjectHyperdragging (Sony, 1999)
  • Not enough working surface on your computer
    screen .
  • No shared working surface when collaborating with
    other people .

4
A Typical Query Optimizer
  • Chapter 15

5
How does a query optimizer work in general?
  • Decompose a SQL query (can have nested queries)
    into query blocks (without nested queries).
  • Translate each query block into relational
    algebra expressions.
  • Optimize the relational algebra expression, one
    query block at a time
  • Enumerate a subset of possible evaluation plans
    (also call explore the plan space)
  • Estimate the cost (disk I/Os) of each explored
    plan using system catalogs and statistics
  • Pick the plan with the least estimated cost

6
Decompose a Query into Query Blocks (Example)
SELECT S.sid, MIN(R.day) FROM Sailors S,
Reserves R, Boats B WHERE S.sidR.sid AND
R.bidB.bid AND B.colorred AND
S.rating (SELECT MAX(S2.rating)
FROM Sailors
S2) GROUP BY S.sid HAVING COUNT() gt 1
SELECT S.sid, MIN(R.day) FROM Sailors S,
Reserves R, Boats B WHERE S.sidR.sid AND
R.bidB.bid AND B.colorred
AND S.rating Reference to the
nested block GROUP BY S.sid HAVING COUNT() gt 1
SELECT MAX(S2.rating) FROM Sailors S2
Nested block
Outer block
7
Optimizing A Query Block
  • Query blocks are optimized one block at a time.
  • Nested blocks are usually treated as calls to a
    subroutine, made once per tuple in the outer
    block.
  • This is an over-simplification, but good enough
    for now.
  • To estimate I/O cost, the optimizer estimates the
    size of (intermediate) results.
  • System catalogs about the lengths of (projected)
    fields
  • Statistics about referenced relationships (file
    size tuples)
  • Available access methods (indexes selection
    conditions), for each relationship in from clause

8
Translate Query Block into Relational Algebra
Expr (1)
SELECT S.sid, MIN(R.day) FROM Sailors S,
Reserves R, Boats B WHERE S.sidR.sid AND
R.bidB.bid AND B.colorred
AND S.rating Reference to the
nested block GROUP BY S.sid HAVING COUNT() gt 1
  • pS.sid,MIN(R.day) (
  • HAVING COUNT()gt2
  • GROUP BY S.sid (
  • s S.sidR.sid AND R.bidB.bid
  • AND B.colorred
  • AND S.ratingvalue_from_nested_block (
  • Sailors?Reserves ?Boats))))
  • Assume that GROUP BY and HAVING are also
    operators.

9
Translate Query Block (2)
pS.sid,MIN(R.day) ( HAVING COUNT()gt2 GROUP
BY S.sid ( s S.sidR.sid AND R.bidB.bid
AND B.colorred AND
S.ratingvalue_from_nested_block (
Sailors?Reserves ?Boats))))
  • Try to simplify query block further into sp?
    expression.
  • Why do this? Easier to find equivalent sp?
    expressions (alternative plans), and they may
    have cheaper costs.
  • How about GROUP BY and HAVING?
  • They are carried out after the result of sp?.
  • Add attributes specified in GROUP BY and HAVING
    into projection list.

pS.sid,MIN(R.day) ( HAVING COUNT()gt2 GROUP
BY S.sid (sp? expression ) pS.sid,R.day ( s
S.sidR.sid AND R.bidB.bid
AND B.colorred AND S.rating
value_from_nested_block ( Sailors ? Reserves ?
Boats))))))
10
Relational Algebra Tree
  • Represent a plan, which is a relational algebra
    (RA) expression, as a RA tree.

sname
sbid100 rating gt 5
SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
sidsid
Sailors
Reserves
11
Estimate Cost of a Plan
  • For each enumerated plan, estimate its cost
  • Each node in the tree involves a relational
    operator. We must estimate the cost of
    evaluating a relational operator.
  • Size of inputs (pages), table statistics
    (selection conditions), available indexes, and
    chosen algorithms for evaluating operators
    (Chapter 14).
  • For each node in the tree, we need to estimate
    the size of the results and whether the results
    are sorted or not.
  • Since the results are inputs to the upper node,
    they are used to estimate the cost of the upper
    node (operator).

12
Notes on Query Optimizer
  • Cost estimation is only an approximation.
  • Consider the cost of Disk I/Os.
  • Plan Space
  • Too large -gt too many possible plans to enumerate
    and too expensive to estimate the costs for all
    of them, must be restricted.
  • Consider only the space of left-deep plans. Why?
  • Left-deep plans allow output of each operator to
    be pipelined into the next (parent) operator
    without physically storing it in a temporary
    relation.
  • Avoid cartesian products.

13
Left Deep Join Trees
  • Restrict the plan search space to only left-deep
    join trees.
  • As the number of joins increases, the number of
    alternative plans grows rapidly we need to
    restrict the search space.
  • Left-deep trees allow us to generate all fully
    pipelined plans (if we choose so).
  • Intermediate results not written to temporary
    files.
  • Not all left-deep trees are fully pipelined,
    depending on the choice of join algorithm (e.g.,
    Sort-Merge join).

?
?
?
D
?
?
?
C
B
A
D
B
A
C
14
Estimating Result Sizes
SELECT attribute list FROM relation
list WHERE term_1 AND term_2 AND term_3 AND
term_n
  • How to estimate the size of result by an operator
    on given inputs?
  • Use information from system catalogs and
    statistics.
  • For each term, find tuple reduction factors (
    expected input tuples / expected qualified
    tuples).

Column value 1 / NKeys(I), if column is index(I), or 10.
Column1 Column2 1 / MAX (NKeys(I1), NKeys(I2)), or 1 / NKeys(I), or 10.
Column gt value (High(I) value) / (High(I) Low(I)), or lt50.
Column in (list of values) (reduction factor for column value) values
Very Rough Estimation of Reduction Factors
15
Improved Statistics Histograms
  • The rough estimation assumes uniform
    distributions of values.
  • What if that assumption is not true? For more
    accurate estimations, use histograms.
  • Histogram is a data structure to approximate data
    distribution.
  • Term children with (age gt 3) result size 2
    tuples.

7
5
2
1
1
ages
1
2
3
4
5
16
Relational Algebra Equivalences
  • Allow us to choose different join orders and to
    push selections and projections ahead of joins.
  • Selections (cascading of selections)
  • Break a selection condition into many smaller
    selections.
  • Combine several selections into one selection.
  • s c1 AND c2 AND cn (R) s c1 (s c2 ( s cn
    (R)) ))
  • Selection (commutative)
  • Test conditions in either order.
  • s c1 (s c2 (R)) s c2 (s c1 (R))

17
Relational Algebra Equivalences (Projections,
Joins, and Cross-Products)
  • Projections (cascading projections)
  • Successively eliminating columns is same as
    eliminating all but the columns of final
    projection.
  • pa1.(R) pa1.( pa2.((pan R)) ..)), where a1 is
    a set of attributes of relation R, and ai is a
    subset of ai1
  • psid.(R) psid.( psid, bid (Reserves))
  • Joins and Cross-Products (commutative)
  • Freedom to choose inner or outer relations
  • R?S S?R
  • Joins and Cross-Products (associative)
  • Join pairs of relations in any order
  • R?(S ?T) (R?S) ?T

18
Relational Algebra Equivalences Involving Two or
More Operators (1)
  • Commute a selection with a projection
  • when the selection condition c involves only
    attributes retained by the projection a.
  • p a (s c (R)) s c (p a (R))
  • p sid (s sid10 (R)) s sid10 (p sid (R))
  • p sid (s bid10 (R)) ? s bid10 (p sid (R))
  • Combine a selection with a cross-product to form
    a join.
  • Push a selection into a cross-product (join)
  • When the selection condition c involves only
    attributes of one of the arguments to the
    cross-product (join).
  • s c (R ? S ) s c (R) ? S
  • s R.bid10 (R ? S ) s R.bid10 (R) ? S
  • s S.rating10 (R ? S ) ? s S.rating10 (R) ? S

19
Relational Algebra Equivalences Involving Two or
More Operators (2)
  • Push a selection into a cross-product (join)
  • Replace a selection with cascading selections,
    and commute selections.
  • c1 involves attributes of both R S (c2 attrs of
    R, c3 attrs of S).
  • s c (R ? S ) s c1 c2 c3 (R ? S )
  • s c1 (s c2 (s c3 (R ? S )))
  • s c1 (s c2 (R) ? s c3 (S ))
  • s sid10, rnameJane snamePaul (R ? S )
  • s sid10 (s snameJane (R) ? s rnamePaul
    (S ))
  • Push a projection into a cross-product (join)
  • when subsets (a1,a2) of projection attribute a
    involves only attributes of one of the arguments
    to the cross-product (join).
  • Same as to push a selection with cross-product
    (join)
  • p a (R ? S ) p a1 (R) ? p a2 (S)
  • p R.sid, S.sname (R ? S ) p R.sid (R) ? p
    S.sname (S)

20
Relational Algebra Equivalences Involving Two or
More Operators (3)
  • Push a projection into a join
  • When a1 is subset of a in R, a2 is a subset of a
    in S, and c is in a.
  • p a (R ?c S ) p a1 (R) ?c p a2 (S)
  • p R.sid, S.sname, S.sid (R ? R.sidS.sid S )
  • (p R.sid (R) ) ? R.sidS.sid (p S.sname,S.sid
    (S) )
  • Push a projection into a join if
  • a1 is subset of R that appear in a and c, and a2
    is a subset of S that appear in a and c
  • p a (R c S ) p a ( p a1 (R) c p a2
    (S))
  • p R.sid, S.sname (R R.sidS.sid S )
  • p R.sid,S.sname ( p R.sid,R.sid (R)
    R.sidS.sid p S.sname,S.sid (S))

21
Enumeration of Alternative Plans
  • There are two main cases
  • Single-relation plans (SELECT FROM Sailors S
    )
  • Multiple-relation plans (SELECT FROM Sailors
    S, Reserves R )
  • For queries over a single relation, queries
    consist of a combination of selections,
    projections, and aggregate ops (no joins).
  • The general strategy has two parts
  • For selections, consider each available access
    path (file scan / index) and pick the one with
    the least estimated cost.
  • Projections and aggregations are carried out
    together with selections.
  • For example, if an index is used for a selection,
    projection is done for each retrieved tuple, and
    the resulting tuples are pipelined into the
    aggregate computation.

22
Single-Relation Queries
  • Example
  • For each rating greater than 5, print the rating
    and the number of 20-year old sailors with that
    rating, provided that there are at last two such
    sailors with different names
  • Plan without indexes
  • Scan Sailors, apply selections and projections.
  • Write out tuples.
  • Sort tuples based on S.rating (GROUP BY).
  • Apply HAVING on the fly at the last sorting step.

23
Single-Relation Queries Plans Utilizing an Index
  • Single-Index Access Path
  • If several indexes match selection condition(s),
    pick the best access path. Apply projections and
    non-primary selections. Sort by grouping
    attributes. Do aggregations.
  • Multiple-Index Access Path
  • If several indexes match selection condition(s),
    use them to retrieve sets of RIDs. Take
    intersection of RID sets. Sort resulting RID by
    page ID. Retrieve all tuples on the same page,
    while applying projections, selections. Sort by
    grouping attributes. Do aggregations.
  • Sorted Index Access Path
  • If GROUP BY attributes matches a tree index, use
    the index to retrieve tuples in order. Then
    apply selections, projections, and aggregation
    operations.
  • Index-Only Access Path
  • If all attributes mentioned in (SELECT, WHERE,
    GROUP BY, HAVING) are in the index, can do an
    index-only scan.

24
Example Using Single-Index Access Path
  • B tree index on rating, hash index on age, and
    B tree index on ltrating, sname, agegt
  • Retrieve Sailors tuple (age20) using hash index
    on age. (most selective path)
  • For retrieved tuple, apply (rating gt5 ) and
    projection out attributes (rating and sname).
  • Write results to temp table.
  • Sort temp table on rating. At the last sorting
    step, apply HAVING and final projection.

25
Queries Over Multiple Relations
  • The general strategy has three parts
  • Consider and enumerate only left-deep join trees.
    Why?
  • Restrict the search space.
  • Left-deep trees allow us to generate all fully
    pipelined plans.
  • Consider selections and projections as early as
    possible (Push selections and projections into
    lower joins)
  • Estimate the cost for each left-deep plan. Pick
    the one with the lowest cost.

26
Enumeration of Left-Deep Plans
  • Left-deep plans differ from each other in
  • The order of relations
  • The access method for each relation
  • The join algorithm for each join operation
  • We have discussed how to enumerate different
    access methods and estimate their costs for one
    relation.
  • Different join algorithms (e.g., nested loop
    join, sort-merge join, hash join, ) and their
    cost analysis are discussed in Chapter 14.

27
Plan Enumeration Algorithm
  • Enumerated using N passes (if N relations
    joined)
  • Pass 1 Find best 1-relation plan for each
    relation.
  • Push selection terms projection attributes to
    that 1-relation (using equivalences)
  • Consider all different access paths, and pick the
    one with lowest cost. This is same algorithm as
    1-relation query.
  • Pass 2 Find best way to join result of each
    1-relation plan (as outer) to another relation.
    (All 2-relation plans.)
  • Again push selection terms projection
    attributes into the inner relation using
    equivalences.
  • Try to pipeline the selected/projected tuples
    from the inner relation into join operation with
    outer relation. (Sometimes cannot, e.g.,
    sort-merge join)
  • Pass N Find best way to join result of a
    (N-1)-relation plan (as outer) to the Nth
    relation. (All N-relation plans.)

28
Plan Enumeration Algorithm Illustration
Pass 1 Same as Single-Relation Query
A
B
C
pipelined into join
consider alt join algs
push sel/proj
Pass 2
outer
inner
Pass 3
29
Enumeration of Plans (Contd.)
  • For each subset of relations, retain only
  • Cheapest plan overall, plus
  • Cheapest plan for each interesting order of the
    tuples.
  • ORDER BY, GROUP BY, aggregates etc. handled as a
    final step. If tuples going into them are not
    sorted, apply additional sorting.
  • Consider left-deep trees, this approach is still
    exponential in the of relations.
  • N relations -gt O( N! ) plans

30
Example
Sailors B tree index on rating Hash index
on sid Reserves B tree index on bid
  • Pass1
  • Sailors B tree matches ratinggt5, and is
    probably cheapest. However, if this selection is
    expected to retrieve a lot of tuples, and index
    is unclustered, file scan may be cheaper.
  • Still, B tree plan kept (because tuples are in
    rating order).
  • Reserves B tree on bid matches bid500
    cheapest.
  • Pass 2
  • We consider each plan retained from Pass 1 as the
    outer, and consider how to join it with the
    (only) other relation.
  • Reserves as outer (better) hash index can be
    used to get Sailors tuples that match sid outer
    tuples sid value (hash join)
  • Sailors as outer (worse) B tree index can be
    used to get Reserves tuples that match bid. Then
    pipeline tuples into join, or write them out for
    sort-merge join.

31
Nested Subqueries
  • Nested block is optimized independently
    (sometimes just evaluated once).
  • In general, nested queries are dealt with using
    some form of nested loops evaluation.
  • Main query as the outer loop, the subquery as the
    inner loop
  • This is necessary for correlated queries below (S
    is used both in the main query and the
    subquery.).

SELECT S.sname FROM Sailors S WHERE S.rating
(SELECT MAX(S2.ratings) FROM
Sailors S2)
SELECT S.sname FROM Sailors S WHERE S.sid IN
(SELECT R.sid FROM Reserves R WHERE
R.bid103)
SELECT S.sname FROM Sailors S WHERE EXISTS
(SELECT FROM Reserves R WHERE
R.bid103 AND S.sidR.rid)
32
Summary
  • Query optimization is an important task in a
    relational DBMS.
  • Must understand optimization in order to
    understand the performance impact of a given
    database design (relations, indexes) on a
    workload (set of queries).
  • Two parts to optimizing a query
  • Consider a set of alternative plans.
  • Must prune search space typically, left-deep
    plans only.
  • Must estimate cost of each plan that is
    considered.
  • Must estimate size of result and cost for each
    plan node.
  • Key issues Statistics, indexes, choice of
    operator algorithms.

33
Summary (Contd.)
  • Single-relation queries
  • All access paths considered, cheapest is chosen.
  • Multiple-relation queries
  • All single-relation plans are first enumerated.
  • Selections/projections considered as early as
    possible.
  • Next, for each 1-relation plan, all different
    ways of joining another relation (as inner) are
    considered.
  • Next, for each 2-relation plan that is
    retained, all ways of joining another relation
    (as inner) are considered, etc.
  • At each level, for each subset of relations, only
    best plan for each interesting order of tuples is
    retained.
Write a Comment
User Comments (0)
About PowerShow.com