Title: Query Execution (end) Query Optimization
1 Query Execution (end)Query Optimization
2Query Execution
3Outline
- Logical/physical operators
- Cost parameters and sorting
- One-pass algorithms
- Nested-loop joins
- Two-pass algorithms
-
4 Query Execution
Query or update
User/ Application
Query compiler
Query execution plan
Execution engine
Record, index requests
Index/record mgr.
Page commands
Buffer manager
Read/write pages
Storage manager
storage
5Logical v.s. Physical Operators
- Logical operators
- what they do
- e.g., union, selection, project, join, grouping
- Physical operators
- how they do it
- e.g., nested loop join, sort-merge join, hash
join, index join
6Query Execution Plans
SELECT S.sname FROM Purchase P, Person
Q WHERE P.buyerQ.name AND
Q.cityurbana AND Q.phone gt 5430000
buyer
?
Cityurbana
phonegt5430000
- Query Plan
- logical tree
- implementation choice at every node
- scheduling of operations.
Buyername
(Simple Nested Loops)
Person
Purchase
(Table scan)
(Index scan)
Some operators are from relational algebra, and
others (e.g., scan, group) are not.
7How do We Combine Operations?
- The iterator model. Each operation is implemented
by 3 functions - Open sets up the data structures and performs
initializations - GetNext returns the the next tuple of the
result. - Close ends the operations. Cleans up the data
structures. - Enables pipelining!
- Contrast with data-driven materialize model.
8Cost Parameters
- Cost parameters
- M number of blocks that fit in main memory
- B(R) number of blocks holding R
- T(R) number of tuples in R
- V(R,a) number of distinct values of the
attribute a - Estimating the cost
- Important in optimization (next lecture)
- Compute I/O cost only
- We compute the cost to read the tables
- We dont compute the cost to write the result
(because pipelining)
9Reminder Sorting
- Two pass multi-way merge sort
- Step 1
- Read M blocks at a time, sort, write
- Result have runs of length M on disk
- Step 2
- Merge M-1 at a time, write to disk
- Result have runs of length M(M-1)?M2
- Cost 3B(R), Assumption B(R) ? M2
10Scanning Tables
- The table is clustered (I.e. blocks consists only
of records from this table) - Table-scan if we know where the blocks are
- Index scan if we have a sparse index to find the
blocks - The table is unclustered (e.g. its records are
placed on blocks with other tables) - May need one read for each record
11Cost of the Scan Operator
- Clustered relation
- Table scan B(R) to sort 3B(R)
- Index scan B(R) to sort B(R) or 3B(R)
- Unclustered relation
- T(R) to sort T(R) 2B(R)
12One pass algorithm
13One-pass Algorithms
- Selection s(R), projection P(R)
- Both are tuple-at-a-Time algorithms
- Cost B(R)
Unary operator
Input buffer
Output buffer
14One-pass Algorithms
- Duplicate elimination d(R)
- Need to keep a dictionary in memory
- balanced search tree
- hash table
- etc
- Cost B(R)
- Assumption B(d(R)) lt M
15One-pass Algorithms
- Grouping gcity, sum(price) (R)
- Need to keep a dictionary in memory
- Also store the sum(price) for each city
- Cost B(R)
- Assumption number of cities fits in memory
16One-pass Algorithms
- Binary operations R n S, R U S, R S
- Assumption min(B(R), B(S)) lt M
- Scan one table first, then the next, eliminate
duplicates - Cost B(R)B(S)
17Nested loop join
18Nested Loop Joins
- Tuple-based nested loop R S
- for each tuple r in R do
- for each tuple s in S do
- if r and s join then output (r,s)
- Cost T(R) T(S), sometimes T(R) B(S)
19Nested Loop Joins
- Block-based Nested Loop Join
- for each (M-1) blocks bs of S do
- for each block br of R do
- for each tuple s in bs do
- for each tuple r in br do
- if r and s join then
output(r,s)
20Nested Loop Joins
R S
Join Result
Hash table for block of S (k lt B-1 pages)
. . .
. . .
Input buffer for R
Output buffer
21Nested Loop Joins
- Block-based Nested Loop Join
- Cost
- Read S once cost B(S)
- Outer loop runs B(S)/(M-1) times, and each time
need to read R costs B(S)B(R)/(M-1) - Total cost B(S) B(S)B(R)/(M-1)
- Notice it is better to iterate over the smaller
relation first - R S Router relation, Sinner relation
22Two pass algorithm
23Two-Pass Algorithms Based on Sorting
- Duplicate elimination d(R)
- Simple idea sort first, then eliminate
duplicates - Step 1 sort runs of size M, write
- Cost 2B(R)
- Step 2 merge M-1 runs, but include each tuple
only once - Cost B(R)
- Some complications...
- Total cost 3B(R), Assumption B(R) lt M2
24Two-Pass Algorithms Based on Sorting
- Grouping gcity, sum(price) (R)
- Same as before sort, then compute the sum(price)
for each group - As before compute sum(price) during the merge
phase. - Total cost 3B(R)
- Assumption B(R) lt M2
25Two-Pass Algorithms Based on Sorting
- Binary operations R n S, R U S, R S
- Idea sort R, sort S, then do the right thing
- A closer look
- Step 1 split R into runs of size M, then split S
into runs of size M. Cost 2B(R) 2B(S) - Step 2 merge M/2 runs from R merge M/2 runs
from S ouput a tuple on a case by cases basis - Total cost 3B(R)3B(S)
- Assumption B(R)B(S)lt M2
26Two-Pass Algorithms Based on Sorting
- Join R S
- Start by sorting both R and S on the join
attribute - Cost 4B(R)4B(S) (because need to write to
disk) - Read both relations in sorted order, match tuples
- Cost B(R)B(S)
- Difficulty many tuples in R may match many in S
- If at least one set of tuples fits in M, we are
OK - Otherwise need nested loop, higher cost
- Total cost 5B(R)5B(S)
- Assumption B(R) lt M2, B(S) lt M2
27Two-Pass Algorithms Based on Sorting
- Join R S
- If the number of tuples in R matching those in S
is small (or vice versa) we can compute the join
during the merge phase - Total cost 3B(R)3B(S)
- Assumption B(R) B(S) lt M2
28Two Pass Algorithms Based on Hashing
- Idea partition a relation R into buckets, on
disk - Each bucket has size approx. B(R)/M
- Does each bucket fit in main memory ?
- Yes if B(R)/M lt M, i.e. B(R) lt M2
1
2
B(R)
29Hash Based Algorithms for d
- Recall d(R) duplicate elimination
- Step 1. Partition R into buckets
- Step 2. Apply d to each bucket (may read in main
memory) - Cost 3B(R)
- AssumptionB(R) lt M2
30Hash Based Algorithms for g
- Recall g(R) grouping and aggregation
- Step 1. Partition R into buckets
- Step 2. Apply g to each bucket (may read in main
memory) - Cost 3B(R)
- AssumptionB(R) lt M2
31Hash-based Join
- R S
- Recall the main memory hash-based join
- Scan S, build buckets in main memory
- Then scan R and join
32Partitioned Hash Join
- R S
- Step 1
- Hash S into M buckets
- send all buckets to disk
- Step 2
- Hash R into M buckets
- Send all buckets to disk
- Step 3
- Join every pair of buckets
33PartitionedHash-Join
- Partition both relations using hash fn h R
tuples in partition i will only match S tuples in
partition i.
- Read in a partition of R, hash it using h2 (ltgt
h!). Scan matching partition of S, search for
matches.
34Partitioned Hash Join
- Cost 3B(R) 3B(S)
- Assumption min(B(R), B(S)) lt M2
35Hybrid Hash Join Algorithm
- When we have more memory B(S) ltlt M2
- Partition S into k buckets
- But keep first bucket S1 in memory, k-1 buckets
to disk - Partition R into k buckets
- First bucket R1 is joined immediately with S1
- Other k-1 buckets go to disk
- Finally, join k-1 pairs of buckets
- (R2,S2), (R3,S3), , (Rk,Sk)
36Hybrid Join Algorithm
- How big should we choose k ?
- Average bucket size for S is B(S)/k
- Need to fit B(S)/k (k-1) blocks in memory
- B(S)/k (k-1) lt M
- k slightly smaller than B(S)/M
37Hybrid Join Algorithm
- How many I/Os ?
- Recall cost of partitioned hash join
- 3B(R) 3B(S)
- Now we save 2 disk operations for one bucket
- Recall there are k buckets
- Hence we save 2/k(B(R) B(S))
- Cost (3-2/k)(B(R) B(S))
(3-2M/B(S))(B(R) B(S))
38Indexed Based Algorithms
- In a clustered index all tuples with the same
value of the key are clustered on as few blocks
as possible
39Index Based Selection
- Selection on equality sav(R)
- Clustered index on a cost B(R)/V(R,a)
- Unclustered index on a cost T(R)/V(R,a)
40Index Based Selection
- Example B(R) 2000, T(R) 100,000, V(R, a)
20, compute the cost of sav(R) - Cost of table scan
- If R is clustered B(R) 2000 I/Os
- If R is unclustered T(R) 100,000 I/Os
- Cost of index based selection
- If index is clustered B(R)/V(R,a) 100
- If index is unclustered T(R)/V(R,a) 5000
- Notice when V(R,a) is small, then unclustered
index is useless
41Index Based Join
- R S
- Assume S has an index on the join attribute
- Iterate over R, for each tuple fetch
corresponding tuple(s) from S - Assume R is clustered. Cost
- If index is clustered B(R) T(R)B(S)/V(S,a)
- If index is unclustered B(R) T(R)T(S)/V(S,a)
42Index Based Join
- Assume both R and S have a sorted index (B tree)
on the join attribute - Then perform a merge join (called zig-zag join)
- Cost B(R) B(S)
43Optimization
- Chapter 16
- At the heart of the database engine
- Step 1 convert the SQL query to some logical
plan - Step 2 find a better logical plan, find an
associated physical plan
44Converting from SQL to Logical Plans
Select a1, , an From R1, , Rk Where C
Pa1,,an(s C(R1 R2 Rk))
Select a1, , an From R1, , Rk Where C Group by
b1, , bl
Pa1,,an(g b1, , bm, aggs (s C(R1 R2
Rk)))
45Converting Nested Queries
- Select distinct product.name
- From product
- Where product.maker in (Select company.name
- From
company - where
company.cityUrbana)
Select distinct product.name From product,
company Where product.maker company.name AND
company.cityUrbana
46Converting Nested Queries
- Select distinct x.name, x.maker
- From product x
- Where x.color blue
- AND x.price gt ALL (Select y.price
- From
product y - Where
x.maker y.maker - AND
y.colorblue)
How do we convert this one to logical plan ?
47Converting Nested Queries
Lets compute the complement first
- Select distinct x.name, x.maker
- From product x
- Where x.color blue
- AND x.price lt SOME (Select y.price
- From
product y - Where
x.maker y.maker - AND
y.colorblue)
48Converting Nested Queries
This one becomes a SFW query
- Select distinct x.name, x.maker
- From product x, product y
- Where x.color blue AND x.maker y.maker
- AND y.colorblue AND x.price lt y.price
This returns exactly the products we DONT want,
so
49Converting Nested Queries
- (Select x.name, x.maker
- From product x
- Where x.color blue)
- EXCEPT
- (Select x.name, x.maker
- From product x, product y
- Where x.color blue AND x.maker y.maker
- AND y.colorblue AND x.price lt y.price)
50Optimization the Logical Query Plan
- Now we have one logical plan
- Algebraic laws
- foundation for every optimization
- Two approaches to optimizations
- Heuristics apply laws that seem to result in
cheaper plans - Cost based estimate size and cost of
intermediate results, search systematically for
best plan
51The three components of an optimizer
- We need three things in an optimizer
- Algebraic laws
- An optimization algorithm
- A cost estimator
52Algebraic Laws
- Commutative and Associative Laws
- R ? S S ? R, R ? (S ? T) (R ? S) ? T
- R n S S n R, R n (S n T) (R n S) n T
- R ? S S ? R, R ? (S ? T) (R ? S) ? T
- Distributive Laws
- R ? (S ? T) (R ? S) ? (R ? T)
53Algebraic Laws
- Laws involving selection
- s C AND C(R) s C(s C(R)) s C(R) n s C(R)
- s C OR C(R) s C(R) U s C(R)
- s C (R ? S) s C (R) ? S
- When C involves only attributes of R
- s C (R S) s C (R) S
- s C (R ? S) s C (R) ? s C (S)
- s C (R n S) s C (R) n S
54Algebraic Laws
- Example R(A, B, C, D), S(E, F, G)
- s F3 (R ?DE S)
? - s A5 AND G9 (R ?DE S)
?
55Algebraic Laws
- Laws involving projections
- PM(R ? S) PN(PP(R) ? PQ(S))
- Where N, P, Q are appropriate subsets of
attributes of M - PM(PN(R)) PM,N(R)
- Example R(A,B,C,D), S(E, F, G)
- PA,B,G(R ? S) P ? (P?(R) ? P?(S))
56Algebraic Laws
- Laws involving grouping and aggregation
- ?(?A, agg(B)(R)) ?A, agg(B)(R)
- ?A, agg(B)(?(R)) ?A, agg(B)(R) if agg is
duplicate insensitive - Which of the following are duplicate
insensitive ?sum, count, avg, min, max - ?A, agg(D)(R(A,B) ?BC S(C,D)) ?A,
agg(D)(R(A,B) ?BC (?B, agg(D)S(C,D))) - Why is this true ?
- Why would we do it ?
57Heuristic Based Optimizations
- Query rewriting based on algebraic laws
- Result in better queries most of the time
- Heuristics number 1
- Push selections down
- Heuristics number 2
- Sometimes push selections up, then down
58Predicate Pushdown
pname
pname
s pricegt100 AND cityUrbana
makername
makername
cityUrbana
pricegt100
Product
Company
Company
Product
The earlier we process selections, less tuples we
need to manipulate higher up in the tree (but may
cause us to loose an important ordering of the
tuples, if we use indexes).
59Predicate Pushdown
Select y.name, Max(x.price) From product x,
company y Where x.maker y.name GroupBy
y.name Having Max(x.price) gt 100
Select y.name, Max(x.price) From product x,
company y Where x.makery.name and
x.price gt 100 GroupBy y.name Having Max(x.price)
gt 100
- For each company, find the maximal price of its
products. - Advantage the size of the join will be smaller.
- Requires transformation rules specific to the
grouping/aggregation - operators.
- Wont work if we replace Max by Min.
60Pushing predicates up
Bargain view V1 categories with some pricelt20,
and the cheapest price
Select V2.name, V2.price From V1, V2 Where
V1.category V2.category and V1.p
V2.price
Create View V1 AS Select x.category,
Min(x.price) AS p From product x Where
x.price lt 20 GroupBy x.category
Create View V2 AS Select y.name, x.category,
x.price From product x, company y Where
x.makery.name
61Query RewritePushing predicates up
Bargain view V1 categories with some pricelt20,
and the cheapest price
Select V2.name, V2.price From V1, V2 Where
V1.category V2.category and V1.p
V2.price AND V1.p lt 20
Create View V1 AS Select x.category,
Min(x.price) AS p From product x Where
x.price lt 20 GroupBy x.category
Create View V2 AS Select y.name, x.category,
x.price From product x, company y Where
x.makery.name
62Query RewritePushing predicates up
Bargain view V1 categories with some pricelt20,
and the cheapest price
Select V2.name, V2.price From V1, V2 Where
V1.category V2.category and V1.p
V2.price AND V1.p lt 20
Create View V1 AS Select x.category,
Min(x.price) AS p From product x Where
x.price lt 20 GroupBy x.category
Create View V2 AS Select y.name, x.category,
x.price From product x, company y Where
x.makery.name AND V1.p lt 20
63Cost-based Optimizations
- Main idea apply algebraic laws, until estimated
cost is minimal - Practically start from partial plans, introduce
operators one by one - Will see in a few slides
- Problem there are too many ways to apply the
laws, hence too many (partial) plans