Title: ICS 214A: Database Management Systems Winter 2004
1ICS 214A Database Management Systems Winter 2004
- Lecture 14 Histograms and Query Optimization
- Professor Chen Li
2Histograms Motivation
- Accurate estimation is critical in query
optimization - Those estimation formulas covered so far might
not be accurate enough - Instead, we keep distribution of R.A (v1,f1), ,
(vn, fn). - Called histogram.
- Idea
- Partition the domain of R.A into buckets
- Store a small summary of the distribution within
each bucket - Number of the buckets is the knob that controls
the resolution
3Equi-width Histogram
- Divide the domain into B buckets of equal width
- E.g., partition GPAs into buckets
- Store the bucket boundaries and the sum of
frequencies of the values with each bucket
4Construction and maintenance
- Construction
- Use one pass over R to construct an accurate
equi-width histogram - Keep a running count for each bucket
- If scanning is not acceptable, use sampling
- Construct a histogram on Rsample, and scale the
frequencies by R/Rsample - Maintenance
- Incremental maintenance for each update on R,
increment/decrement the corresponding bucket
frequencies - Periodical re-computation because distribution
changes slowly
5Using an equi-width histogram
- Q sA5(R)
- 5 is in bucket 5,8 (with 19 tuples)
- Assume uniform distribution within the bucket
- Thus Q ? 19/4 ? 5.
- Actual value is 1
- Q sAgt7 A lt 16(R)
- 7,16 covers 9,12 (27 tuples) and 13,16 (13
tuples) - 7,16 partially covers 5,8 (19 tuples)
- Thus Q ? 19/2 27 13 ? 50
- Actual value 52.
6Equi-height Histogram
- Divide the domain into B buckets with roughly the
same number of tuples in each bucket - Store this number and the bucket boundaries
- Intuition high frequencies are more important
than low frequencies
7Construction and maintenance
- Construction
- Sort all R.A values, and then take equally spaced
slights - Example 1 2 2 3 4 7 8 9 10 10 10 10 11 11 12 12
14 16 - Sampling also works
- Maintenance
- Incremental maintenance
- Merge adjacent buckets with small counts
- Split any bucket with a large count
- Select the median value to split
- Need a sample of the values within this bucket to
work well - Periodic re-computation also works.
8Using an equi-height histogram
- Q sA5(R)
- 5 is in bucket 1,7 (with 16 tuples)
- Assume uniform distribution within the bucket
- Thus Q ? 16/7 ? 2. (actual value 1)
- Q sAgt7 A lt 16(R)
- 7,16 covers 8,9, 10,11,12,16 (all with
tuples) - 7,16 partially covers 1,7 (16 tuples)
- Thus Q ? 16/7 16 16 16 ? 50
- Actually Q 52.
9Histogram tricks
- Store the number of distinct values in each
bucket - To get rid of the effects of the values with 0
frequency - These values tend to cause underestimation
- Compressed histogram
- Store (vi, fi) pairs explicitly if fi is high
- For other values, use an equi-width or
equi-height histogram
10More Histograms
- V-optimal histogram
- Avoid putting very different frequencies into the
same bucket - Partition in a way to minimize ?iVARi, where VARi
is the frequency variance within bucket i - MaxDiff Histogram
- Define area to be the product of the frequency of
a value and its spread (the difference between
this value and the next value with non-zero
frequency) - Insert bucket boundaries where two adjacent areas
differ by large amounts - More in Poosala et al., SIGMOD 1996
11Did not cover
- Multidimensional histograms
- Using histograms to estimate other operators
(e.g., joins) - Other mechanisms for selectivity estimation
- Wavelets
- Sampling
- Probabilistic models of data
- Much recent research on selectivity estimation in
the context of - Query processing
- OLAP
12Next query optimization
13Query Optimization
SQL query
parse
parse tree
convert
answer
logical query plan
execute
apply laws
statistics
Pi
improved l.q.p
pick best
estimate result sizes
(P1,C1),(P2,C2)...
l.q.p. sizes
estimate costs
consider physical plans
P1,P2,..
14Query Optimization
- Convert a logical query plan into an optimal
physical plan. - Step 1 use rewrite laws to improve logical plan.
- Step 2 use cost-based enumeration to choose a
good physical plan corresponding to logical plan. - Determine order of join
- Specify algorithm for each operator
- Materialization vs. pipelining
- Issues
- 1. Estimation cost of operators
- cost model
- input size
- 2. Enumeration algorithm
15Example SQL query
- SELECT title
- FROM StarsIn
- WHERE starName IN (
- SELECT name
- FROM MovieStar
- WHERE birthdate LIKE 1960
- )
- (Find the movies with stars born in 1960)
16Parse Tree
ltQuerygt
ltSFWgt
SELECT ltSelListgt FROM ltFromListgt
WHERE ltConditiongt
ltAttributegt ltRelNamegt
ltTuplegt IN ltQuerygt
title StarsIn
ltAttributegt ( ltQuerygt )
starName ltSFWgt
SELECT ltSelListgt FROM ltFromListgt
WHERE ltConditiongt
ltAttributegt ltRelNamegt
ltAttributegt LIKE ltPatterngt
name MovieStar
birthDate 1960
17Logical Query Plan techniques to covert a parse
tree to a logical plan
?title
?starNamename
?
StarsIn ?name
?birthdate LIKE 1960
MovieStar
18Improved Logical Query Plan
?title
Push selection down.
starNamename
In general how?
StarsIn ?name
?birthdate LIKE 1960
MovieStar
19Relational algebra optimization
- Transformation rules (preserve equivalence)
- What are good transformations?
20Rules Natural joins cross products union
- Carry attribute names in results, so order is not
important - Can also write as trees, e.g.
-
- T R
R S S T
21Rules Natural joins cross products union
R S S R (R S) T R (S T)
- R ? S S ? R
- (R ? S) ? T R ? (S ? T)
- R ? S S ? R
- R ? (S ? T) (R ? S) ? T
Commutative law
Associative Law
22Rules Selects
sp1 sp2 (R) sp2 sp1 (R) sp1 (R) ?
sp2 (R)
- Example R(dept, id, salary), S(id, city, tel)
- P1 salary gt 50K
- P2 dept toy