Database Techniek Query Optimization

About This Presentation

Title:

Database Techniek Query Optimization

Description:

Dynamic Programming (System R/DB2) Heuristics (Ingres/Postgres) ... With dynamic programming time complexity of optimization with bushy trees is O(3n) ... – PowerPoint PPT presentation

Number of Views:293

Avg rating:3.0/5.0

Slides: 66

Provided by: homepa3

Category:

more less

Transcript and Presenter's Notes

Title: Database Techniek Query Optimization

1
Database TechniekQuery Optimization(chapter
14)
2
Lecture 3

Query Rewriting
Equivalence Rules
Query Optimization
Dynamic Programming (System R/DB2)
Heuristics (Ingres/Postgres)
De-correlation of nested queries
Result Size Estimation
Practicum Assignment 2

3
Lecture 3

Query Rewriting
Equivalence Rules
Query Optimization
Dynamic Programming (System R/DB2)
Heuristics (Ingres/Postgres)
De-correlation of nested queries
Result Size Estimation
Practicum Assignment 2

4
Transformation of Relational Expressions

Two relational algebra expressions are said to be
equivalent if on every legal database instance
the two expressions generate the same set of
tuples
Note order of tuples is irrelevant
In SQL, inputs and outputs are bags (multi-sets)
of tuples
Two expressions in the bag version of the
relational algebra are said to be equivalent if
on every legal database instance the two
expressions generate the same bag of tuples
An equivalence rule says that expressions of two
forms are equivalent
Can replace expression of first form by second,
or vice versa

5
Equivalence Rules

1. Conjunctive selection operations can be
deconstructed into a sequence of individual
selections.
2. Selection operations are commutative.
3. Only the last in a sequence of projection
operations is needed, the others can be
omitted.
Selections can be combined with Cartesian
products and theta joins.
??(E1 X E2) E1 ? E2
??1(E1 ?2 E2) E1 ?1??2 E2

6
Algebraic Rewritings for Selection
s
cond1
s
cond2
R
s
s
cond1 AND cond2
cond2
R
s
cond1
R
?
s
cond1 OR cond2
s
s
cond2
cond1
R
R
7
Equivalence Rules (Cont.)

5. Theta-join operations (and natural joins) are
commutative. E1 ? E2 E2 ? E1
6. Natural join operations are associative
(E1 E2) E3 E1 (E2 E3)

8
Equivalence Rules for Joins
commutative
associative
9
Equivalence Rules (Cont.)

For pushing down selections into a (theta) join
we have the following cases
(push 1) When all the attributes in ?0 involve
only the attributes of one of the expressions
(E1) being joined. ??0?E1 ?
E2) (??0(E1)) ? E2
(split) When ? 1 involves only the attributes of
E1 and ?2 involves only the attributes of E2.
??1??? ?E1 ? E2)
(??1(E1)) ? (??? (E2))
(impossible) When ? involves both attributes of
E1 and E2 (it is a join condition)

10
Pushing Selection thru Cartesian Product and Join
s
?
cond
?
s
cond
R
R
S
The right direction requires that cond refers to
S attributes only
S
s
cond
s
cond
R
R
S
S
11
Projection Decomposition
pXYpXpY
totalX.taxY.price
p
p
total tax price
p
p
p
pXYpXpY
p
p
?
?
?
p
p
pXp?
pYp?
X
Y
X
Y
X.tax
Y.price
12
More Equivalence Rules

8. The projections operation distributes over the
theta join operation as follows
(a) if ? involves only attributes from L1 ?
L2
(b) Consider a join E1 ? E2.
Let L1 and L2 be sets of attributes from E1 and
E2, respectively.
Let L3 be attributes of E1 that are involved in
join condition ?, but are not in L1 ? L2, and
Let L4 be attributes of E2 that are involved in
join condition ?, but are not in L1 ? L2.

13
Join Ordering Example

For all relations r1, r2, and r3,
(r1 r2) r3 r1 (r2 r3 )
If r2 r3 is quite large and r1 r2 is
small, we choose
(r1 r2) r3
so that we compute and store a smaller temporary
relation.

14
Join Ordering Example (Cont.)

Consider the expression
?customer-name ((?branch-city Brooklyn
(branch))
account depositor)
Could compute account depositor first, and
join result with ?branch-city Brooklyn
(branch)but account depositor is likely to be
a large relation.
Since it is more likely that only a small
fraction of the banks customers have accounts in
branches located in Brooklyn, it is better to
compute
?branch-city Brooklyn (branch) account
first.

15
Lecture 3

Query Rewriting
Equivalence Rules
Query Optimization
Dynamic Programming (System R/DB2)
Heuristics (Ingres/Postgres)
De-correlation of nested queries
Result Size Estimation

16
Lecture 3

Query Rewriting
Equivalence Rules
Query Optimization
Dynamic Programming (System R/DB2)
Heuristics (Ingres/Postgres)
De-correlation of nested queries
Result Size Estimation

17
The role of Query Optimization
SQL
parsing, normalization
logical algebra
physical query optimization
logical query optimization
physical algebra
query execution
18
The role of Query Optimization
Compare different relational algebra plan ? on
result size (Practicum 2A)
SQL
parsing, normalization
logical algebra
physical query optimization
logical query optimization
physical algebra
query execution
19
The role of Query Optimization
SQL
parsing, normalization
logical algebra
physical query optimization
logical query optimization
physical algebra

Compare different execution algorithms
on true cost
(IO, CPU, cache)

query execution
20
Enumeration of Equivalent Expressions

Query optimizers use equivalence rules to
systematically generate expressions equivalent to
the given expression
repeated until no more expressions can be found
for each expression found so far, use all
applicable equivalence rules, and add newly
generated expressions to the set of expressions
found so far
The above approach is very expensive in space and
time
Time and space requirements are reduced by not
generating all expressions

21
Finding A Good Join Order

Consider finding the best join-order for r1 r2
. . . rn.
There are (2(n 1))!/(n 1)! different join
orders for above expression. With n 7, the
number is 665280, with n 10, the number is
greater than 176 billion!
No need to generate all the join orders. Using
dynamic programming, the least-cost join order
for any subset of r1, r2, . . . rn is computed
only once and stored for future use.

22
Dynamic Programming in Optimization

To find best join tree for a set of n relations
To find best plan for a set S of n relations,
consider all possible plans of the form S1
(S S1) where S1 is any non-empty subset of S.
Recursively compute costs for joining subsets of
S to find the cost of each plan. Choose the
cheapest of the 2n 1 alternatives.
When plan for any subset is computed, store it
and reuse it when it is required again, instead
of recomputing it
Dynamic programming

23
Join Order Optimization Algorithm

procedure findbestplan(S)if (bestplanS.cost ?
?) return bestplanS// else bestplanS has
not been computed earlier, compute it nowfor
each non-empty subset S1 of S such that S1 ?
S P1 findbestplan(S1) P2 findbestplan(S -
S1) A best algorithm for joining results of P1
and P2 cost P1.cost P2.cost cost of A if
cost lt bestplanS.cost bestplanS.cost
cost bestplanS.plan execute P1.plan
execute P2.plan join results of P1 and
P2 using Areturn bestplanS

24
Left Deep Join Trees

In left-deep join trees, the right-hand-side
input for each join is a relation, not the result
of an intermediate join.

25
Cost of Optimization

With dynamic programming time complexity of
optimization with bushy trees is O(3n).
With n 10, this number is 59000 instead of 176
billion!
Space complexity is O(2n)
To find best left-deep join tree for a set of n
relations
Consider n alternatives with one relation as
right-hand side input and the other relations as
left-hand side input.
Using (recursively computed and stored)
least-cost join order for each alternative on
left-hand-side, choose the cheapest of the n
alternatives.
If only left-deep trees are considered, time
complexity of finding best join order is O(n 2n)
Space complexity remains at O(2n)
Cost-based optimization is expensive, but
worthwhile for queries on large datasets (typical
queries have small n, generally lt 10)

26
Physical Query Optimization

Minimizes absolute cost
Minimize I/Os
Minimize CPU, cache miss cost (main memory DBMS)
Must consider the interaction of evaluation
techniques when choosing evaluation plans
choosing the cheapest algorithm for each
operation independently may not yield best
overall algorithm. E.g.
merge-join may be costlier than hash-join, but
may provide a sorted output which reduces the
cost for an outer level aggregation.
nested-loop join may provide opportunity for
pipelining

27
Physical Optimization Interesting Orders

Consider the expression (r1 r2 r3) r4
r5
An interesting sort order is a particular sort
order of tuples that could be useful for a later
operation.
Generating the result of r1 r2 r3 sorted on
the attributes common with r4 or r5 may be
useful, but generating it sorted on the
attributes common only r1 and r2 is not useful.
Using merge-join to compute r1 r2 r3 may be
costlier, but may provide an output sorted in an
interesting order.
Not sufficient to find the best join order for
each subset of the set of n given relations must
find the best join order for each subset, for
each interesting sort order
Simple extension of earlier dynamic programming
algorithms
Usually, number of interesting orders is quite
small and doesnt affect time/space complexity
significantly

28
Heuristic Optimization

Cost-based optimization is expensive, even with
dynamic programming.
Systems may use heuristics to reduce the number
of choices that must be made in a cost-based
fashion.
Heuristic optimization transforms the query-tree
by using a set of rules that typically (but not
in all cases) improve execution performance
Perform selection early (reduces the number of
tuples)
Perform projection early (reduces the number of
attributes)
Perform most restrictive selection and join
operations before other similar operations.
Some systems use only heuristics, others combine
heuristics with partial cost-based optimization.

29
Steps in Typical Heuristic Optimization

1. Deconstruct conjunctive selections into a
sequence of single selection operations (Equiv.
rule 1.).
2. Move selection operations down the query tree
for the earliest possible execution (Equiv. rules
2, 7a, 7b, 11).
3. Execute first those selection and join
operations that will produce the smallest
relations (Equiv. rule 6).
4. Replace Cartesian product operations that are
followed by a selection condition by join
operations (Equiv. rule 4a).
5. Deconstruct and move as far down the tree as
possible lists of projection attributes, creating
new projections where needed (Equiv. rules 3, 8a,
8b, 12).
6. Identify those subtrees whose operations can
be pipelined, and execute them using pipelining).

30
Heuristic Join Order the Wong-Youssefi algorithm
(INGRES)
Sample TPC-H Schema Nation(NationKey,
NName) Customer(CustKey, CName,
NationKey) Order(OrderKey, CustKey,
Status) Lineitem(OrderKey, PartKey,
Quantity) Product(SuppKey, PartKey,
PName) Supplier(SuppKey, SName)
Find the names of suppliers that sell a product
that appears in a line item of an order made by a
customer who is in Canada
SELECT SName FROM Nation, Customer, Order,
LineItem, Product, Supplier WHERE
Nation.NationKey Cuctomer.NationKey AND
Customer.CustKey Order.CustKey AND
Order.OrderKeyLineItem.OrderKey AND
LineItem.PartKey Product.Partkey AND
Product.Suppkey Supplier.SuppKey AND NName
Canada
31
Challenges with Large Natural Join Expressions

For simplicity, assume that in the query
All joins are natural
whenever two tables of the FROM clause have
common
attributes we join on them
Consider Right-Index only

pSName
RI
RI
One possible order
RI
RI
RI
Index
sNNameCanada
Nation
Customer
Order
LineItem
Product
Supplier
32
Wong-Yussefi algorithm assumptions and objectives

Assumption 1 (weak) Indexes on all join
attributes (keys and foreign keys)
Assumption 2 (strong) At least one selection
creates a small relation
A join with a small relation results in a small
relation
Objective Create sequence of index-based joins
such that all intermediate results are small

33
Hypergraphs
Customer
CName CustKey
Nation
NationKey NName
LineItem
Quantity PartKey
Order
Status OrderKey
Supplier
SName
SuppKey PName
Product

relation hyperedges
two hyperedges for same relation are possible
each node is an attribute
can extend for non-natural equality joins by
merging nodes

34
Small Relations/Hypergraph Reduction
Nation is small because it has the equality
selection NName Canada
Customer
CName CustKey
Nation
NationKey NName
NationKey NName
LineItem
Quantity PartKey
Order
Status OrderKey
Supplier
SName
SuppKey PName
Product
Index
Pick a small relation (and its conditions) to
start the plan
sNNameCanada
Nation
35
Customer
(1) Remove small relation (hypergraph reduction)
and color as small any relation that joins with
the removed small relation
CName CustKey
Nation
NationKey NName
NationKey NName
LineItem
Quantity PartKey
Order
Status OrderKey
Supplier
SName
SuppKey PName
Product
RI
(2) Pick a small relation (and its conditions if
any) and join it with the small relation that has
been reduced
Index
sNNameCanada
Customer
Nation
36
After a bunch of steps
pSName
RI
RI
RI
RI
RI
Index
sNNameCanada
Nation
Customer
Order
LineItem
Product
Supplier
37
Some Query Optimizers

The System R/Starburst dynamic programming on
left-deep join orders. Also uses heuristics to
push selections and projections down the query
tree.
DB2, SQLserver are cost-based optimizers
SQLserver is transformation based, also uses
dynamic programming.
MySQL optimizer is heuristics-based (rather weak)
Heuristic optimization used in some versions of
Oracle
Repeatedly pick best relation to join next
Starting from each of n starting points. Pick
best among these.

38
Lecture 3

Query Rewriting
Equivalence Rules
Query Optimization
Dynamic Programming (System R/DB2)
Heuristics (Ingres/Postgres)
De-correlation of nested queries
Result Size Estimation
Practicum Assignment 2

39
Lecture 3

Query Rewriting
Equivalence Rules
Query Optimization
Dynamic Programming (System R/DB2)
Heuristics (Ingres/Postgres)
De-correlation of nested queries
Result Size Estimation
Practicum Assignment 2

40
Optimizing Nested Subqueries

SQL conceptually treats nested subqueries in the
where clause as functions that take parameters
and return a single value or set of values
Parameters are variables from outer level query
that are used in the nested subquery such
variables are called correlation variables
E.g.select customer-namefrom borrowerwhere
exists (select from
depositor where
depositor.customer-name

borrower.customer-name)
Conceptually, nested subquery is executed once
for each tuple in the cross-product generated by
the outer level from clause
Such evaluation is called correlated evaluation
Note other conditions in where clause may be
used to compute a join (instead of a
cross-product) before executing the nested
subquery

41
Optimizing Nested Subqueries (Cont.)

Correlated evaluation may be quite inefficient
since
a large number of calls may be made to the nested
query
there may be unnecessary random I/O as a result
SQL optimizers attempt to transform nested
subqueries to joins where possible, enabling use
of efficient join techniques
E.g. earlier nested query can be rewritten as
select customer-namefrom borrower,
depositorwhere depositor.customer-name
borrower.customer-name
Note above query doesnt correctly deal with
duplicates, can be modified to do so as we will
see
In general, it is not possible/straightforward to
move the entire nested subquery from clause into
the outer level query from clause
A temporary relation is created instead, and used
in body of outer level query

42
Optimizing Nested Subqueries (Cont.)

In general, SQL queries of the form below can be
rewritten as shown
Rewrite select from L1
where P1 and exists (select
from L2 where P2)
To create table t1 as
select distinct V from L2
where P21 select
from L1, t1 where P1
and P22
P21 contains predicates in P2 that do not involve
any correlation variables
P22 reintroduces predicates involving
correlation variables, with relations renamed
appropriately
V contains all attributes used in predicates with
correlation variables

43
Optimizing Nested Subqueries (Cont.)

In our example, the original nested query would
be transformed to create table t1 as
select distinct customer-name from
depositor select customer-name from
borrower, t1 where t1.customer-name
borrower.customer-name
The process of replacing a nested query by a
query with a join (possibly with a temporary
relation) is called decorrelation.
Decorrelation is more complicated when
the nested subquery uses aggregation, or
when the result of the nested subquery is used
to test for equality, or
when the condition linking the nested subquery to
the other query is not exists,
and so on.

44
Practicum Assignment 2A

Get the XML metadata description for TPC-H
xslt script for plotting histograms
Take our solution for your second query
(assignment 1)
For each operator in the tree give
Selectivity
Intermediate Result size
Short description how you computed this
Explanation how to compute histograms on all
result columns
Sum all intermediate result sizes into total
query cost
DEADLINE march 31

45
The Big Picture

1. Parsing and translation
2. Optimization
3. Evaluation

46
The Big Picture

1. Parsing and translation
2. Optimization
3. Evaluation

47
Optimization

Query Optimization Amongst all equivalent
evaluation plans choose the one with lowest cost.
Cost is estimated using statistical information
from the database catalog
e.g. number of tuples in each relation, size of
tuples, etc.
In this lecture we study logical cost estimation
introduction to histograms
estimating the amount of tuples in the result
with perfect and equi-height histograms
propagation of histograms into result columns
How to compute result size from width and tuples

48
Cost Estimation

Physical cost estimation
predict I/O blocks, seeks, cache misses, RAM
consumption,
Depends in the execution algorithm
In this lecture we study logical cost estimation
the plan with smallest intermediate result tends
to be best
need estimations for intermediate result sizes
Histogram-based estimation (practicum, assignment
2)
estimating the amount of tuples in the result
with perfect and equi-height histograms
propagation of histograms into result columns
compute result size as tuple-width tuples

49
Selectivities

select
expr X(col,const) X in lt, lt, , gt, gt
expr expr expr expr
sop(R) R' / R.
join
only 1-n / n-1 foreign key joins
s join(R1,R2) R' / (R1R2).
aggr
s(A(Rg1a)) A(Rg1a) / R distinct(R.g1)
/ R.
s(A(Rg1,g2a)) (distinct(R.g1)
distinct(R.g2)) / R.
project, order
s project(R) s order(R) 1
topn
s topN(R) min(N,R) / R min(N/R,1).

50
Result Size

tuples_max selectivity columns
We disregard differences in column-width
project
columns projectlist
tuples_max R
aggr
columns groupbys aggrs
tuples_max min(R, g1 .. gn)
join
columns child1 child2
tuples_max R1 R2
other
columns stays equal wrt child
tuples_max R

51
Selectivity estimation

We can estimate the selectivities using
domain constraints
min/max statistics
histograms

52
Histograms

Buckets
B ltmin, max, total, distinctgt
Leave min out (Bi.min Bi-1.max)

53
Different Kinds of Histograms

Perfect
Equi-width
Equi-height
In the practicum we use
Perfect histograms, when distinct(R.a) lt 25
Equi-height histograms of 10 buckets, otherwise
Not perfectly even-height disjunct value ranges
between buckets
(i.e. frequent value is not split over
even-height buckets. It may create a
bigger-than-height bucket)

54
Perfect Histograms Equi-Selection

s(R.aC) Bk.total (1/R)
in case there is a k with Bk.max C
s(R.aC) 0
otherwise

s(R.ad)
total
a
c
f
d
55
Perfect Histograms Equi-Selection

s(R.aC) Bk.total (1/R)
in case there is a k with Bk.max C
s(R.aC) 0
otherwise

s(R.ad)
total
a
c
d
f
56
Perfect Histograms Range-Selection

s(R.altC) sum(Bi.total) (1/R),
for all 1 lt i lt k with B(k-1).max lt C lt Bk.max

s(R.altd)
total
a
c
d
f
57
Perfect Histograms Range-Selection

s(R.altC) sum(Bi.total) (1/R),
for all 1 lt i lt k with B(k-1).max lt C lt Bk.max

s(R.altd)
total
a
c
d
f
58
Equi-Height Histograms Equi-Selection

s(R.aC) avg_freq(Bk) (1/R)
in case there is a k with B(k-1).max lt C lt
Bk.max
avg_freq(Bk) Bk.total / Bk.distinct
s(R.aC) 0
otherwise

s(R.ac)
total
a
d
f
e
59
Equi-Height Histograms Equi-Selection

s(R.aC) avg_freq(Bk) (1/R)
in case there is a k with B(k-1).max lt C lt
Bk.max
avg_freq(Bk) Bk.total / Bk.distinct
s(R.aC) 0
otherwise

s(R.ac)
total
a
d
f
e
60
Equi-Height Histograms Range-Selection

s(R.altC) ( sum(Bi.total) freq_lt(Bk,C) )
(1/R),
for all 1 lt i lt k with B(k-1).max lt C lt Bk.max.

s(R.altc)
total
a
d
f
e
61
Equi-Height Histograms Range-Selection

s(R.altC) ( sum(Bi.total) freq_lt(Bk,C) )
(1/R),
for all 1 lt i lt k with B(k-1).max lt C lt Bk.max.

s(R.altc)
total
a
d
f
e
62
Select with And and Or

Assume no correlation between attributes
s(?a and ?c) s(?a) s(?c)
s(?a or ?c) s(?a) (1-s(?a)) s(?c)
Note must normalize ?a , ?c into non-overlapping
conditions

63
Foreign-key Join Selectivity/Hitrate Estimation

Foreign-key constraint
R1 matches at most once with R2
each order matches on average with 7 lineitems
?hitrate 7
But what if R2 (e.g. order) is an intermediate
result?
R2 may have multiple key occurrences due to a
previous join
R2 may have less key occurrences (missing keys)
due to a select (or join).
Simple Approach (practicum)
Hitrate R2/R2

64
Aggr(R,g1..gn,..)

Can only predict groupby columns and size
Expected result size
min(R,distinct(g1) . distinct(gn))

65
Histogram Propagation

order histogram stays identical
project histogram stays identical
Expression (e.g. l_taxl_price) not required for
the practicum
possible to use cartesian product on histograms,
followed by expression evaluation and
re-bucketizing.
topn not required for the practicum
Use last bucket (and backwards) to take highest N
distinct values and their frequencies
aggr not required for the practicum
Groupbys distinct is multiplication of
distincts, freq1
Aggregates only possible for global aggregates
(no groupbys)
fk-join multiply totals by join hitrate
distinct min(distinct,total) ? this is a
simplicifation!
Select multiply totals by selectivity
distinct min(distinct,total)
Select (selection attribute)
Get totals/distincts from subset of buckets