Query Optimization - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Query Optimization

Description:

Query Optimization R&G, Chapter 15 Lecture 17 – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 29
Provided by: ebe90
Category:

less

Transcript and Presenter's Notes

Title: Query Optimization


1
Query Optimization
  • RG, Chapter 15
  • Lecture 17

2
Administrivia
  • Homework 3 available from class website
  • Due date Tuesday, March 20 by end of class
    period
  • Homework 4 available today
  • Implement nested loops and hash join operators
    for (new!) minibase
  • Due date April 10 (after Spring Break)
  • Midterm 2 is 3/22, 2 weeks from today
  • In class, covers lectures 10-17
  • Review will be held Tuesday 3/20 7-9 pm 306 Soda
    Hall
  • Internships at Google this summer
  • See http//www.postgresql.org/developer/summerofco
    de
  • Booth at the UCB TechExpo this Thursday
  • http//csba.berkeley.edu/tech_expo.html
  • Contact josh_at_postgresql.org

3
Review
We are here
  • Query plans are a tree of operators that compute
    the result of a query
  • Optimization is the process of picking the best
    plan
  • Execution is the process of executing the plan

4
Query Plans turning text into tuples
Query Result
Query
Shift
Name
SELECT A.aname, max(F.feedingshift) FROM Animals
A, Feeding F WHERE A.aid F.aid AND (A.species
'Big Cat' or A.species 'Bear') GROUP BY
A.aname HAVING COUNT() gt 1
Aslan 3
Bageera 3
Elsa 3
Shere Khan 3
Tigger 2
Operators
Query Plan
10 2 1 100 3
10 3 2 100 3
20 3 2 100 3
20 2 3 100 3
30 3 2 100 3

40 100 Aslan Big Cat
50 300 Baloo Bear
60 100 Bageera Big Cat
70 100 Shere Khan Big Cat
90 100 Dumbo Elephant

5
Query Optimization steps
SELECT A.aname, max(F.feedingshift) FROM Animals
A, Feeding F WHERE A.aid F.aid AND (A.species
'Big Cat' or A.species 'Bear') GROUP BY
A.aname HAVING COUNT() gt 1
Query parser
Block 3
  • Parse query from text to intermediate model
  • Traverse intermediate model and produce
    alternate query plans
  • Query plan relational algebra tree
  • Plan cost cumulative cost of tree
  • Consider equivalent but alternative relational
    algebra tress
  • Optimizer keeps track of cost and properties of
    plans
  • Pick the cheapest plan

Block 2
Block 1
Query optimizer
Cost 200
Cost 500
Cost 150
To execution engine
6
Optimized query is 124x cheaper than the original!
SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
500,500 IOs
4010 IOs
7
System R-Style Optimization
  • Impact
  • Most widely used currently works well for lt 10
    joins.
  • Cost estimation
  • Plan cost cumulative cost of plan tree
  • Cost function(I/O, CPU) costs
  • Very inexact, but works ok in practice.
  • Statistics, maintained in system catalogs, used
    to estimate cost of operations and result sizes.
  • Plan Space Too large, must be pruned.
  • Many plans share common, overpriced subtrees
  • ignore them all!
  • Only left-deep plans are considered.
  • May cause optimizer to miss good plans
  • Avoid Cartesian products.

8
Query Optimization steps
Block 3
Block 2
Block 1
  • Parse query from text to intermediate model
  • -gt A list of query blocks
  • For each query block, pick the best plan
  • Convert block to tree of relational algebra
    operators
  • Build alternative plans bottom up
  • Leaf nodes represent access paths to relations
  • Consider reordering relational algebra tree
  • Push selections and projections down
  • Consider left-deep join plans
  • Avoid cross products
  • Consider all possible join methods
  • Prune expensive plans with the same properties

Rel 1 access path
Rel 2 access path
X
BNL Cost 200 no sorted order
BNL Cost 150 no sorted order
X
X
HJ Cost 125 no sorted order
HJ Cost 125 no sorted order
X
SM Cost 300 Sorted by bid
SM Cost 300 Sorted by bid
X
9
Schema for Examples
Sailors (sid integer, sname string, rating
integer, age real) Reserves (sid integer, bid
integer, day dates, rname string)
  • Reserves
  • Each tuple is 40 bytes long, 100 tuples per
    page, 1000 pages. 100 distinct bids.
  • Sailors
  • Each tuple is 50 bytes long, 80 tuples per page,
    500 pages. 10 ratings, 40,000 sids.

10
Translating SQL to Relational Algebra

SELECT S.sid, MIN (R.day) FROM Sailors S,
Reserves R, Boats B WHERE S.sid R.sid AND
R.bid B.bid AND B.color red AND S.rating gt
5 GROUP BY S.sid HAVING COUNT () gt 2
For each sailor with a rating gt 5 that has
reserved at least 2 red boats, find the sailor id
and the earliest date on which the sailor has a
reservation for a red boat.
11
Translating SQL to Relational Algebra
SELECT S.sid, MIN (R.day) FROM Sailors S,
Reserves R, Boats B WHERE S.sid R.sid AND
R.bid B.bid AND B.color red AND S.rating gt
5 GROUP BY S.sid HAVING COUNT () gt 2

HAVING COUNT()gt2
GROUP BY S.Sid
sB.color red
S.rating gt 5
V
Sailors Reserves Boats
12
Relational Algebra Equivalences
  • Allow us to choose different join orders and to
    push selections and projections ahead of joins.
  • Selections can be cascaded
  • ?c1??cn(R) ? ?c1((?cn(R)))

HAVING COUNT()gt2
SELECT S.sid, MIN (R.day) FROM Sailors S,
Reserves R, Boats B WHERE S.sid R.sid AND
R.bid B.bid AND B.color red AND S.rating gt
5 GROUP BY S.sid HAVING COUNT () gt 2
GROUP BY S.Sid
sB.color red
Reserves Boats
  • Can apply these predicates separately
  • Can push S.rating gt 5 down to Sailors

sS.rating gt 5
Sailors
13
Relational Algebra Equivalences
  • Selections can be commuted
  • ?c1(?c2(R)) ? ?c2(?c1(R))

SELECT S.sid, MIN (R.day) FROM Sailors S,
Reserves R, Boats B WHERE S.sid R.sid AND
R.bid B.bid AND B.color red AND S.rating gt
5 GROUP BY S.sid HAVING COUNT () gt 2
HAVING COUNT()gt2
GROUP BY S.Sid
sS.Rating gt 5
Can apply these predicates in different order
Reserves Sailors
sB.color red
Boats
14
Relational Algebra Equivalences
  • Projections can be cascaded
  • ?a1(R) ? ?a1((?a1, , an(R)))

HAVING COUNT()gt2
SELECT S.sid, MIN (R.day) FROM Sailors S,
Reserves R, Boats B WHERE S.sid R.sid AND
R.bid B.bid AND B.color red AND S.rating gt
5 GROUP BY S.sid HAVING COUNT () gt 2
GROUP BY S.Sid
sB.color red
Can project S.sid to reduce size of tuples
Reserves Boats
sS.rating gt 5
Sailors
15
Relational Algebra Equivalences
SELECT S.sid, MIN (R.day) FROM Sailors S,
Reserves R, Boats B WHERE S.sid R.sid AND
R.bid B.bid AND B.color red AND S.rating gt
5 GROUP BY S.sid HAVING COUNT () gt 2
HAVING COUNT()gt2
GROUP BY S.Sid
  • Eager projection
  • Can cascade and push some projections thru
    selection
  • Can cascade and push some projections below one
    side of a join
  • Rule of thumb can project anything not needed
    downstream

sB.color red
sS.rating gt 5
Boats
Sailors
Reserves
16
Relational Algebra Equivalences
  • Cartesian Product and Joins
  • R ? (S ? T) ? (R ? S) ? T (associative)
  • R ? S ? S ? R (commutative)
  • This means we can do joins in any order.
  • Butbeware of cartesian product!
  • Only consider R X S if there is a join predicate
    between R
  • Select S.sid, R.bid, B.bid
  • From Sailors S, Reserves R, Boats B
  • Where S.sid R.sid and R.bid B.bid
  • What about this query?
  • Select S.sid, B.bid
  • From Sailors S, Reserves R, Boats B
  • Where S.sid R.sid

No need to consider plans with S X B
User asked for Cartesian product so you have to
compute it!
17
Cost Estimation
  • For each plan considered, must estimate total
    cost
  • Must estimate cost of each operation in plan
    tree.
  • Plan cost cumulative cost of the operators
  • Operator cost is computed from
  • Input cardinalities.
  • Cost of each operator
  • e.g. (sequential scan, index scan, joins, etc.)
  • Must estimate size of result for each operation
    in tree!
  • It will contribute to the input cardinality for
    the next operator up the tree!
  • Computed from cardinality of input relations
    selecitivity of the predicates.
  • For selections and joins, assume predicates are
    independent.
  • Q Is cost the same as estimated run time?

18
Statistics and Catalogs
  • Optimizer needs statistics about relations and
    their indexes to compute cardinality and
    selectivity estimates.
  • System Catalogs contain metadata about tables and
    relations
  • Just another set of relations you can query them
    like any other table!
  • For optimizer, they typically contain
  • tuples (cardinality) and pages (NPages) per
    reln.
  • distinct key values (NKeys) for each index.
  • low/high key values (Low/High) for each index.
  • Index height (IHeight) for each tree index.
  • index pages (INPages) for each index.
  • Statistics must be kept up to date
  • Updating whenever data changes is too expensive
    lots of approximation anyway, so slight
    inconsistency ok.
  • Updated periodically
  • E.g. DB2 runstats on table feeding
  • Statistics out of date -gt optimizer choosing very
    bad plans!

19
The optimizer relies on good statistics
Table Stats Name NumPages Sailors
1 Reserves 1
T1 Create table Sailors() T2 Create table
Reserves() T3 Runstats on table Sailors T4
Runstats on table Reserves T5 Insert 100 pages
of Sailors into Sailors T6 Runstats on
Sailors T6 Insert 1000 pages of Reservations
into Reserves T7 Run Query
X 100
SELECT S.sname, R.bid FROM Sailors S, Reserves
R WHERE S.sid R.sid
Optimizer would incorrectly pick this plan!
  • Assume 7 pages of memory and only BNL
  • BNLJ SailorsXReserves
  • 100 100/5x1 120

ReservesXSailors 1 1/5x100 101
20
Size Estimation and Reduction Factors
SELECT attribute list FROM relation list WHERE
pred1 AND ... AND predk
  • Consider a query block
  • Maximum Cardinality product of the
    cardinalities of relations in the FROM clause.
  • Reduction factor (RF) associated with each
    predicate reflects the impact of the predicate in
    reducing result size.
  • Result cardinality Max tuples product of
    all RFs.
  • RF usually called selectivity
  • only RG seem to call it Reduction Factor

21
Result Size Estimation
  • Result cardinality
    Max tuples
    product of all RFs.
  • Term colvalue (given index I on col, or
    knowledge about column values)
  • RF 1/NKeys(I)
  • Term col1col2 (This is handy for joins too)
  • RF 1/MAX(NKeys(I1), NKeys(I2))
  • Term colgtvalue
  • RF (High(I)-value)/(High(I)-Low(I))
  • (Implicit assumptions values are uniformly
    distributed and predicates are independent!)
  • Question What if the optimizer has no statistics?

22
No statistics is a common case
  • RF 1/NKeys(I) for a column with an index
  • What about a non-index column,
  • e.g. R.day 01/31/2007?
  • The original System R paper suggested 1/10 for
    this case, and many systems today still use that!
  • Here is what POSTGRES does
  • / default selectivity estimate for pattern-match
    operators such as LIKE /
  • define DEFAULT_MATCH_SEL 0.005
  • / default number of distinct values in a table
    /
  • define DEFAULT_NUM_DISTINCT 200
  • / default selectivity estimate for boolean and
    null test nodes /
  • define DEFAULT_UNK_SEL 0.005
  • define DEFAULT_NOT_UNK_SEL (1.0 -
    DEFAULT_UNK_SEL)
  • / default selectivity estimate for equalities
    such as "A b" /
  • define DEFAULT_EQ_SEL 0.005
  • / default selectivity estimate for inequalities
    such as "A lt b" /
  • define DEFAULT_INEQ_SEL 0.3333333333333333
  • / default selectivity estimate for range
    inequalities "A gt b AND A lt c" /
  • define DEFAULT_RANGE_INEQ_SEL 0.005

23
What if data is skewed?
  • For better estimation, use a histogram
  • Captures tuples in ranges of values
  • -gt doesnt assume a uniform distribution of values

equiwidth
equidepth
24
Estimating Join Cardinality
  • Q Given a join of R and S, what is the range of
    possible result sizes (in of tuples)?
  • Suppose the join is on a key for R and S
  • e.g. Students(sid, sname, did), Dorm(did,d.addr)
  • Select S.sid, D.address
  • From Students S, Dorms D
  • Where S.did D.did
  • What is the cardinality?

A student can only live in at most 1 dorm -gt each
S tuple can match with at most 1 D tuple -gt
cardinality (S join D) cardinality of S
25
Estimating Join Cardinality
  • General case join on A (A is key for
    neither)
  • estimate each tuple r of R generates uniform
    number of matches in S and each tuple s of S
    generates uniform number of matches in R, e.g.
  • RF min(NTuples(R) NTuples(S)/NKeys(A,S)
  • NTuples(S)
    NTuples(R)/NKeys(A,R))
  • Sailors 100 tuples, 75 unique names -gt 1.3
    tuples for each sailor name
  • Boats 20 tuples, 10 unique names -gt 2 tuples for
    each boat name
  • e.g. SELECT S.name, B.name
  • FROM Sailors S, Boats B
  • WHERE S.name B.name

RF 10020/10 200 RF 20100/75 26.6
26
Plan Enumeration
  • A heuristic decision in System Ronly left-deep
    join trees are considered.
  • As the number of joins increases, the number of
    alternative plans grows rapidly we need to
    restrict the search space.
  • Left-deep trees allow us to generate all fully
    pipelined plans.
  • Intermediate results not written to temporary
    files.
  • Not all left-deep trees are fully pipelined
    (e.g., SM join).

27
Enumeration of Left-Deep Plans
  • Left-deep plans differ
  • Order of relations
  • Access method for each relation,
  • Join method for each join.
  • Enumerated using N passes (if N relations
    joined)
  • Pass 1 Find best 1-relation plan for each
    relation.
  • Pass 2 Find best way to join result of each
    1-relation plan (as outer) to another relation.
    (All 2-relation plans.)
  • Pass N Find best way to join result of a
    (N-1)-relation plan (as outer) to the Nth
    relation. (All N-relation plans.)
  • For each subset of relations, retain only
  • Cheapest plan overall, plus
  • Cheapest plan for each interesting order of the
    tuples.

28
A Note on Interesting Orders
  • An intermediate result has an interesting order
    if it is sorted by any of
  • ORDER BY attributes
  • GROUP BY attributes
  • Join attributes of yet-to-be-added (downstream)
    joins

29
Enumeration of Plans (Contd.)
  • Avoid Cartesian products!
  • An N-1 way plan is not combined with an
    additional relation unless there is a join
    condition between them, or all predicates in
    WHERE have been used up.
  • i.e., consider a Cartesian product only if the
    user specified one!
  • ORDER BY, GROUP BY, aggregates
  • handled as a final step, using either an
    interestingly ordered plan or an additonal
    sort/hash operator.
  • In spite of pruning plan space, System R approach
    is still exponential in the of tables.
Write a Comment
User Comments (0)
About PowerShow.com