Title: Database Techniek Query Optimization
1Database TechniekQuery Optimization(chapter
14)
2Lecture 3
- Query Rewriting
- Equivalence Rules
- Query Optimization
- Dynamic Programming (System R/DB2)
- Heuristics (Ingres/Postgres)
- De-correlation of nested queries
- Result Size Estimation
- Practicum Assignment 2
3Lecture 3
- Query Rewriting
- Equivalence Rules
- Query Optimization
- Dynamic Programming (System R/DB2)
- Heuristics (Ingres/Postgres)
- De-correlation of nested queries
- Result Size Estimation
- Practicum Assignment 2
4Transformation of Relational Expressions
- Two relational algebra expressions are said to be
equivalent if on every legal database instance
the two expressions generate the same set of
tuples - Note order of tuples is irrelevant
- In SQL, inputs and outputs are bags (multi-sets)
of tuples - Two expressions in the bag version of the
relational algebra are said to be equivalent if
on every legal database instance the two
expressions generate the same bag of tuples - An equivalence rule says that expressions of two
forms are equivalent - Can replace expression of first form by second,
or vice versa
5Equivalence Rules
- 1. Conjunctive selection operations can be
deconstructed into a sequence of individual
selections. - 2. Selection operations are commutative.
- 3. Only the last in a sequence of projection
operations is needed, the others can be
omitted. - Selections can be combined with Cartesian
products and theta joins. - ??(E1 X E2) E1 ? E2
- ??1(E1 ?2 E2) E1 ?1??2 E2
6Algebraic Rewritings for Selection
s
cond1
s
cond2
R
s
s
cond1 AND cond2
cond2
R
s
cond1
R
?
s
cond1 OR cond2
s
s
cond2
cond1
R
R
7Equivalence Rules (Cont.)
- 5. Theta-join operations (and natural joins) are
commutative. E1 ? E2 E2 ? E1 - 6. Natural join operations are associative
- (E1 E2) E3 E1 (E2 E3)
8Equivalence Rules for Joins
commutative
associative
9Equivalence Rules (Cont.)
- For pushing down selections into a (theta) join
we have the following cases - (push 1) When all the attributes in ?0 involve
only the attributes of one of the expressions
(E1) being joined. ??0?E1 ?
E2) (??0(E1)) ? E2 - (split) When ? 1 involves only the attributes of
E1 and ?2 involves only the attributes of E2. - ??1??? ?E1 ? E2)
(??1(E1)) ? (??? (E2)) - (impossible) When ? involves both attributes of
E1 and E2 (it is a join condition)
10Pushing Selection thru Cartesian Product and Join
s
?
cond
?
s
cond
R
R
S
The right direction requires that cond refers to
S attributes only
S
s
cond
s
cond
R
R
S
S
11Projection Decomposition
pXYpXpY
totalX.taxY.price
p
p
total tax price
p
p
p
pXYpXpY
p
p
?
?
?
p
p
pXp?
pYp?
X
Y
X
Y
X.tax
Y.price
12More Equivalence Rules
- 8. The projections operation distributes over the
theta join operation as follows - (a) if ? involves only attributes from L1 ?
L2 - (b) Consider a join E1 ? E2.
- Let L1 and L2 be sets of attributes from E1 and
E2, respectively. - Let L3 be attributes of E1 that are involved in
join condition ?, but are not in L1 ? L2, and - Let L4 be attributes of E2 that are involved in
join condition ?, but are not in L1 ? L2.
13Join Ordering Example
- For all relations r1, r2, and r3,
- (r1 r2) r3 r1 (r2 r3 )
- If r2 r3 is quite large and r1 r2 is
small, we choose - (r1 r2) r3
- so that we compute and store a smaller temporary
relation.
14Join Ordering Example (Cont.)
- Consider the expression
- ?customer-name ((?branch-city Brooklyn
(branch))
account depositor) - Could compute account depositor first, and
join result with ?branch-city Brooklyn
(branch)but account depositor is likely to be
a large relation. - Since it is more likely that only a small
fraction of the banks customers have accounts in
branches located in Brooklyn, it is better to
compute - ?branch-city Brooklyn (branch) account
- first.
15Lecture 3
- Query Rewriting
- Equivalence Rules
- Query Optimization
- Dynamic Programming (System R/DB2)
- Heuristics (Ingres/Postgres)
- De-correlation of nested queries
- Result Size Estimation
16Lecture 3
- Query Rewriting
- Equivalence Rules
- Query Optimization
- Dynamic Programming (System R/DB2)
- Heuristics (Ingres/Postgres)
- De-correlation of nested queries
- Result Size Estimation
17The role of Query Optimization
SQL
parsing, normalization
logical algebra
physical query optimization
logical query optimization
physical algebra
query execution
18The role of Query Optimization
Compare different relational algebra plan ? on
result size (Practicum 2A)
SQL
parsing, normalization
logical algebra
physical query optimization
logical query optimization
physical algebra
query execution
19The role of Query Optimization
SQL
parsing, normalization
logical algebra
physical query optimization
logical query optimization
physical algebra
- Compare different execution algorithms
- on true cost
- (IO, CPU, cache)
query execution
20Enumeration of Equivalent Expressions
- Query optimizers use equivalence rules to
systematically generate expressions equivalent to
the given expression - repeated until no more expressions can be found
- for each expression found so far, use all
applicable equivalence rules, and add newly
generated expressions to the set of expressions
found so far - The above approach is very expensive in space and
time - Time and space requirements are reduced by not
generating all expressions
21Finding A Good Join Order
- Consider finding the best join-order for r1 r2
. . . rn. - There are (2(n 1))!/(n 1)! different join
orders for above expression. With n 7, the
number is 665280, with n 10, the number is
greater than 176 billion! - No need to generate all the join orders. Using
dynamic programming, the least-cost join order
for any subset of r1, r2, . . . rn is computed
only once and stored for future use.
22Dynamic Programming in Optimization
- To find best join tree for a set of n relations
- To find best plan for a set S of n relations,
consider all possible plans of the form S1
(S S1) where S1 is any non-empty subset of S. - Recursively compute costs for joining subsets of
S to find the cost of each plan. Choose the
cheapest of the 2n 1 alternatives. - When plan for any subset is computed, store it
and reuse it when it is required again, instead
of recomputing it - Dynamic programming
23Join Order Optimization Algorithm
- procedure findbestplan(S)if (bestplanS.cost ?
?) return bestplanS// else bestplanS has
not been computed earlier, compute it nowfor
each non-empty subset S1 of S such that S1 ?
S P1 findbestplan(S1) P2 findbestplan(S -
S1) A best algorithm for joining results of P1
and P2 cost P1.cost P2.cost cost of A if
cost lt bestplanS.cost bestplanS.cost
cost bestplanS.plan execute P1.plan
execute P2.plan join results of P1 and
P2 using Areturn bestplanS
24Left Deep Join Trees
- In left-deep join trees, the right-hand-side
input for each join is a relation, not the result
of an intermediate join.
25Cost of Optimization
- With dynamic programming time complexity of
optimization with bushy trees is O(3n). - With n 10, this number is 59000 instead of 176
billion! - Space complexity is O(2n)
- To find best left-deep join tree for a set of n
relations - Consider n alternatives with one relation as
right-hand side input and the other relations as
left-hand side input. - Using (recursively computed and stored)
least-cost join order for each alternative on
left-hand-side, choose the cheapest of the n
alternatives. - If only left-deep trees are considered, time
complexity of finding best join order is O(n 2n) - Space complexity remains at O(2n)
- Cost-based optimization is expensive, but
worthwhile for queries on large datasets (typical
queries have small n, generally lt 10)
26Physical Query Optimization
- Minimizes absolute cost
- Minimize I/Os
- Minimize CPU, cache miss cost (main memory DBMS)
- Must consider the interaction of evaluation
techniques when choosing evaluation plans
choosing the cheapest algorithm for each
operation independently may not yield best
overall algorithm. E.g. - merge-join may be costlier than hash-join, but
may provide a sorted output which reduces the
cost for an outer level aggregation. - nested-loop join may provide opportunity for
pipelining
27Physical Optimization Interesting Orders
- Consider the expression (r1 r2 r3) r4
r5 - An interesting sort order is a particular sort
order of tuples that could be useful for a later
operation. - Generating the result of r1 r2 r3 sorted on
the attributes common with r4 or r5 may be
useful, but generating it sorted on the
attributes common only r1 and r2 is not useful. - Using merge-join to compute r1 r2 r3 may be
costlier, but may provide an output sorted in an
interesting order. - Not sufficient to find the best join order for
each subset of the set of n given relations must
find the best join order for each subset, for
each interesting sort order - Simple extension of earlier dynamic programming
algorithms - Usually, number of interesting orders is quite
small and doesnt affect time/space complexity
significantly
28Heuristic Optimization
- Cost-based optimization is expensive, even with
dynamic programming. - Systems may use heuristics to reduce the number
of choices that must be made in a cost-based
fashion. - Heuristic optimization transforms the query-tree
by using a set of rules that typically (but not
in all cases) improve execution performance - Perform selection early (reduces the number of
tuples) - Perform projection early (reduces the number of
attributes) - Perform most restrictive selection and join
operations before other similar operations. - Some systems use only heuristics, others combine
heuristics with partial cost-based optimization.
29Steps in Typical Heuristic Optimization
- 1. Deconstruct conjunctive selections into a
sequence of single selection operations (Equiv.
rule 1.). - 2. Move selection operations down the query tree
for the earliest possible execution (Equiv. rules
2, 7a, 7b, 11). - 3. Execute first those selection and join
operations that will produce the smallest
relations (Equiv. rule 6). - 4. Replace Cartesian product operations that are
followed by a selection condition by join
operations (Equiv. rule 4a). - 5. Deconstruct and move as far down the tree as
possible lists of projection attributes, creating
new projections where needed (Equiv. rules 3, 8a,
8b, 12). - 6. Identify those subtrees whose operations can
be pipelined, and execute them using pipelining).
30Heuristic Join Order the Wong-Youssefi algorithm
(INGRES)
Sample TPC-H Schema Nation(NationKey,
NName) Customer(CustKey, CName,
NationKey) Order(OrderKey, CustKey,
Status) Lineitem(OrderKey, PartKey,
Quantity) Product(SuppKey, PartKey,
PName) Supplier(SuppKey, SName)
Find the names of suppliers that sell a product
that appears in a line item of an order made by a
customer who is in Canada
SELECT SName FROM Nation, Customer, Order,
LineItem, Product, Supplier WHERE
Nation.NationKey Cuctomer.NationKey AND
Customer.CustKey Order.CustKey AND
Order.OrderKeyLineItem.OrderKey AND
LineItem.PartKey Product.Partkey AND
Product.Suppkey Supplier.SuppKey AND NName
Canada
31Challenges with Large Natural Join Expressions
- For simplicity, assume that in the query
- All joins are natural
- whenever two tables of the FROM clause have
common - attributes we join on them
- Consider Right-Index only
pSName
RI
RI
One possible order
RI
RI
RI
Index
sNNameCanada
Nation
Customer
Order
LineItem
Product
Supplier
32Wong-Yussefi algorithm assumptions and objectives
- Assumption 1 (weak) Indexes on all join
attributes (keys and foreign keys) - Assumption 2 (strong) At least one selection
creates a small relation - A join with a small relation results in a small
relation - Objective Create sequence of index-based joins
such that all intermediate results are small
33Hypergraphs
Customer
CName CustKey
Nation
NationKey NName
LineItem
Quantity PartKey
Order
Status OrderKey
Supplier
SName
SuppKey PName
Product
- relation hyperedges
- two hyperedges for same relation are possible
- each node is an attribute
- can extend for non-natural equality joins by
merging nodes
34Small Relations/Hypergraph Reduction
Nation is small because it has the equality
selection NName Canada
Customer
CName CustKey
Nation
NationKey NName
NationKey NName
LineItem
Quantity PartKey
Order
Status OrderKey
Supplier
SName
SuppKey PName
Product
Index
Pick a small relation (and its conditions) to
start the plan
sNNameCanada
Nation
35Customer
(1) Remove small relation (hypergraph reduction)
and color as small any relation that joins with
the removed small relation
CName CustKey
Nation
NationKey NName
NationKey NName
LineItem
Quantity PartKey
Order
Status OrderKey
Supplier
SName
SuppKey PName
Product
RI
(2) Pick a small relation (and its conditions if
any) and join it with the small relation that has
been reduced
Index
sNNameCanada
Customer
Nation
36After a bunch of steps
pSName
RI
RI
RI
RI
RI
Index
sNNameCanada
Nation
Customer
Order
LineItem
Product
Supplier
37Some Query Optimizers
- The System R/Starburst dynamic programming on
left-deep join orders. Also uses heuristics to
push selections and projections down the query
tree. - DB2, SQLserver are cost-based optimizers
- SQLserver is transformation based, also uses
dynamic programming. - MySQL optimizer is heuristics-based (rather weak)
- Heuristic optimization used in some versions of
Oracle - Repeatedly pick best relation to join next
- Starting from each of n starting points. Pick
best among these.
38Lecture 3
- Query Rewriting
- Equivalence Rules
- Query Optimization
- Dynamic Programming (System R/DB2)
- Heuristics (Ingres/Postgres)
- De-correlation of nested queries
- Result Size Estimation
- Practicum Assignment 2
39Lecture 3
- Query Rewriting
- Equivalence Rules
- Query Optimization
- Dynamic Programming (System R/DB2)
- Heuristics (Ingres/Postgres)
- De-correlation of nested queries
- Result Size Estimation
- Practicum Assignment 2
40Optimizing Nested Subqueries
- SQL conceptually treats nested subqueries in the
where clause as functions that take parameters
and return a single value or set of values - Parameters are variables from outer level query
that are used in the nested subquery such
variables are called correlation variables - E.g.select customer-namefrom borrowerwhere
exists (select from
depositor where
depositor.customer-name
borrower.customer-name) - Conceptually, nested subquery is executed once
for each tuple in the cross-product generated by
the outer level from clause - Such evaluation is called correlated evaluation
- Note other conditions in where clause may be
used to compute a join (instead of a
cross-product) before executing the nested
subquery
41Optimizing Nested Subqueries (Cont.)
- Correlated evaluation may be quite inefficient
since - a large number of calls may be made to the nested
query - there may be unnecessary random I/O as a result
- SQL optimizers attempt to transform nested
subqueries to joins where possible, enabling use
of efficient join techniques - E.g. earlier nested query can be rewritten as
select customer-namefrom borrower,
depositorwhere depositor.customer-name
borrower.customer-name - Note above query doesnt correctly deal with
duplicates, can be modified to do so as we will
see - In general, it is not possible/straightforward to
move the entire nested subquery from clause into
the outer level query from clause - A temporary relation is created instead, and used
in body of outer level query
42Optimizing Nested Subqueries (Cont.)
- In general, SQL queries of the form below can be
rewritten as shown - Rewrite select from L1
where P1 and exists (select
from L2 where P2) - To create table t1 as
select distinct V from L2
where P21 select
from L1, t1 where P1
and P22 - P21 contains predicates in P2 that do not involve
any correlation variables - P22 reintroduces predicates involving
correlation variables, with relations renamed
appropriately - V contains all attributes used in predicates with
correlation variables
43Optimizing Nested Subqueries (Cont.)
- In our example, the original nested query would
be transformed to create table t1 as
select distinct customer-name from
depositor select customer-name from
borrower, t1 where t1.customer-name
borrower.customer-name - The process of replacing a nested query by a
query with a join (possibly with a temporary
relation) is called decorrelation. - Decorrelation is more complicated when
- the nested subquery uses aggregation, or
- when the result of the nested subquery is used
to test for equality, or - when the condition linking the nested subquery to
the other query is not exists, - and so on.
44Practicum Assignment 2A
- Get the XML metadata description for TPC-H
- xslt script for plotting histograms
- Take our solution for your second query
(assignment 1) - For each operator in the tree give
- Selectivity
- Intermediate Result size
- Short description how you computed this
- Explanation how to compute histograms on all
result columns - Sum all intermediate result sizes into total
query cost - DEADLINE march 31
45The Big Picture
- 1. Parsing and translation
- 2. Optimization
- 3. Evaluation
46The Big Picture
- 1. Parsing and translation
- 2. Optimization
- 3. Evaluation
47Optimization
- Query Optimization Amongst all equivalent
evaluation plans choose the one with lowest cost.
- Cost is estimated using statistical information
from the database catalog - e.g. number of tuples in each relation, size of
tuples, etc. - In this lecture we study logical cost estimation
- introduction to histograms
- estimating the amount of tuples in the result
with perfect and equi-height histograms - propagation of histograms into result columns
- How to compute result size from width and tuples
48Cost Estimation
- Physical cost estimation
- predict I/O blocks, seeks, cache misses, RAM
consumption, - Depends in the execution algorithm
- In this lecture we study logical cost estimation
- the plan with smallest intermediate result tends
to be best - need estimations for intermediate result sizes
- Histogram-based estimation (practicum, assignment
2) - estimating the amount of tuples in the result
with perfect and equi-height histograms - propagation of histograms into result columns
- compute result size as tuple-width tuples
49Selectivities
- select
- expr X(col,const) X in lt, lt, , gt, gt
- expr expr expr expr
- sop(R) R' / R.
- join
- only 1-n / n-1 foreign key joins
- s join(R1,R2) R' / (R1R2).
- aggr
- s(A(Rg1a)) A(Rg1a) / R distinct(R.g1)
/ R. - s(A(Rg1,g2a)) (distinct(R.g1)
distinct(R.g2)) / R. - project, order
- s project(R) s order(R) 1
- topn
- s topN(R) min(N,R) / R min(N/R,1).
50Result Size
- tuples_max selectivity columns
- We disregard differences in column-width
- project
- columns projectlist
- tuples_max R
- aggr
- columns groupbys aggrs
- tuples_max min(R, g1 .. gn)
- join
- columns child1 child2
- tuples_max R1 R2
- other
- columns stays equal wrt child
- tuples_max R
51Selectivity estimation
- We can estimate the selectivities using
- domain constraints
- min/max statistics
- histograms
52Histograms
- Buckets
- B ltmin, max, total, distinctgt
- Leave min out (Bi.min Bi-1.max)
53Different Kinds of Histograms
- Perfect
- Equi-width
- Equi-height
- In the practicum we use
- Perfect histograms, when distinct(R.a) lt 25
- Equi-height histograms of 10 buckets, otherwise
- Not perfectly even-height disjunct value ranges
between buckets - (i.e. frequent value is not split over
even-height buckets. It may create a
bigger-than-height bucket)
54Perfect Histograms Equi-Selection
- s(R.aC) Bk.total (1/R)
- in case there is a k with Bk.max C
- s(R.aC) 0
- otherwise
s(R.ad)
total
a
c
f
d
55Perfect Histograms Equi-Selection
- s(R.aC) Bk.total (1/R)
- in case there is a k with Bk.max C
- s(R.aC) 0
- otherwise
s(R.ad)
total
a
c
d
f
56Perfect Histograms Range-Selection
- s(R.altC) sum(Bi.total) (1/R),
- for all 1 lt i lt k with B(k-1).max lt C lt Bk.max
s(R.altd)
total
a
c
d
f
57Perfect Histograms Range-Selection
- s(R.altC) sum(Bi.total) (1/R),
- for all 1 lt i lt k with B(k-1).max lt C lt Bk.max
s(R.altd)
total
a
c
d
f
58Equi-Height Histograms Equi-Selection
- s(R.aC) avg_freq(Bk) (1/R)
- in case there is a k with B(k-1).max lt C lt
Bk.max - avg_freq(Bk) Bk.total / Bk.distinct
- s(R.aC) 0
- otherwise
s(R.ac)
total
a
d
f
e
59Equi-Height Histograms Equi-Selection
- s(R.aC) avg_freq(Bk) (1/R)
- in case there is a k with B(k-1).max lt C lt
Bk.max - avg_freq(Bk) Bk.total / Bk.distinct
- s(R.aC) 0
- otherwise
s(R.ac)
total
a
d
f
e
60Equi-Height Histograms Range-Selection
- s(R.altC) ( sum(Bi.total) freq_lt(Bk,C) )
(1/R), - for all 1 lt i lt k with B(k-1).max lt C lt Bk.max.
s(R.altc)
total
a
d
f
e
61Equi-Height Histograms Range-Selection
- s(R.altC) ( sum(Bi.total) freq_lt(Bk,C) )
(1/R), - for all 1 lt i lt k with B(k-1).max lt C lt Bk.max.
s(R.altc)
total
a
d
f
e
62Select with And and Or
- Assume no correlation between attributes
- s(?a and ?c) s(?a) s(?c)
- s(?a or ?c) s(?a) (1-s(?a)) s(?c)
- Note must normalize ?a , ?c into non-overlapping
conditions
63Foreign-key Join Selectivity/Hitrate Estimation
- Foreign-key constraint
- R1 matches at most once with R2
- each order matches on average with 7 lineitems
?hitrate 7 - But what if R2 (e.g. order) is an intermediate
result? - R2 may have multiple key occurrences due to a
previous join - R2 may have less key occurrences (missing keys)
due to a select (or join). - Simple Approach (practicum)
- Hitrate R2/R2
64Aggr(R,g1..gn,..)
- Can only predict groupby columns and size
- Expected result size
- min(R,distinct(g1) . distinct(gn))
65Histogram Propagation
- order histogram stays identical
- project histogram stays identical
- Expression (e.g. l_taxl_price) not required for
the practicum - possible to use cartesian product on histograms,
followed by expression evaluation and
re-bucketizing. - topn not required for the practicum
- Use last bucket (and backwards) to take highest N
distinct values and their frequencies - aggr not required for the practicum
- Groupbys distinct is multiplication of
distincts, freq1 - Aggregates only possible for global aggregates
(no groupbys) - fk-join multiply totals by join hitrate
- distinct min(distinct,total) ? this is a
simplicifation! - Select multiply totals by selectivity
- distinct min(distinct,total)
- Select (selection attribute)
- Get totals/distincts from subset of buckets
66Practicum Assignment 2
- Get the XML metadata description for TPC-H
- ps/pfd histograms also available
- Take our solution for your second query
(assignment 1) - For each operator in the tree give
- Selectivity
- Intermediate Result size
- Short description how you computed this
- Explanation how to compute histograms on all
result columns - Sum all intermediate result sizes into total
query cost - DEADLINE march 31