Title: Fast Incremental Maintenance of Approximate histograms :
1Fast Incremental Maintenance of Approximate
histograms
- Phillip B. Gibbons (Intel Research Pittsburgh)
- Yossi Matias (Tel Aviv University)
- Viswanath Poosala (Bell Laboratories)
- Presented by
- Amrita Tamrakar
- CSE 6392
- 09-feb-2006
2Introduction
- What is a histogram?
- Issues in Histogram maintenance
- Novel concept of Backing sample
- Types of approximate histograms
- Incremental maintenance of approx histograms
- Challenges and solutions
- Conclusion
3What is a histogram?
Frequency
- maintained to approximate the distribution of
data in the attributes - constructed by partitioning the data into
mutually disjoint subsets - Frequency as y axis and the data intervals as x
axis - Oracle, DB2, SQLserver, Sybase, Informix
Data value interval
http//www.shodor.org/interactivate/activities/his
togram/
4History of Histogram
Equi-width histogram
Compressed histogram
Learn more on Histogram
5Issues on Histogram Maintenance
- precomputed on underlying data
- Stored in main memory , less overhead
- What about the maintenance ??
- Database is modified
- Query is changed(?)
- Outdated histogram
- Does periodic updates solve the problem?
- Recomputing from the scratch
- Poor estimation during the in-between period
- Whats the solution ?
6The solution to outdated histograms
- Maintain Approximate histogram in presence of
database updates - Split and merge technique for quick adjustment
- Backing sample stored in secondary mm
7Backing Sample
- Only row id and the necessary attributes
- At any time, backing sample random sample
- No entire table scan
- Records in Consecutive disk blocks
Histogram
Backing sample (100KB)
2 KB
Main memory
Relation (20GB)
8How to maintain a backing sample?
- During insertions
- Reservoir sampling technique
- Obtain sample of data from a single scan without
a priori knowledge of no of tuples. - Length of random skip chosen such that each tuple
is likely to be in the reservoir.
1
First n
2
MaintainBackingSample
Skip random no of record and replace
n
n1
9How to maintain a backing sample?
- During modification
- Modify if tuple present in sample
- During Deletion
- Remove from the sample
- If sample size decrease below lower bound L, then
recompute from disk.
10Maintain approximate Histograms Different
Classes of Histograms
- Equidepth histograms
- No. of tuples in each bucket is same
- Contiguous ranges of attribute values
Frequency of occurrence
Data value
11Different Classes of Histograms
- Compressed (V,F) histogram
- N highest frequencies stored in singleton buckets
- For other values, use equi-depth histogram
- Both histograms needs to store for each bucket
- The largest value in the bucket B.maxval
- The Count B.count
- Approximate histograms are calculated from the
random sample of the Relation - How to maintain these histograms?
12Fast Incremental maintenance of approximate
equi-depth histograms
- During Insertion
- Maintain a threshold (T) upper bound
- If no of tuples lt T, insertion will increment the
bucket count. - Else recompute the histogram
- Split and merge algorithm
- Reduce the no. of recomputations from the sample
- When bucket count reaches T, instead of
recomputing split the bucket in half. - But maintain the number of bucket as fixed by
merging two buckets whose total countltT
13Split n merge algorithm
Insert threshold
14To handle modify and delete
- Deletion can lower the bucket count
- Maintain a Tl as lower threshold
- Merge if below threshold
- Split bucket with largest count
Delete threshold
15Fast Incremental maintenance of approximate
compressed histograms
- Values with high frequencies can span more than
one bucket replace by single bucket with single
count singleton buckets - Construct compressed histogram on the sample and
scale it by N/k factor. - During insertions
- If the count doesnt exceed threshold, add to the
bucket, else update bucket boundaries
16Challenges to maintain compressed histograms
- New values may lead to data skew, which may lead
to new singleton buckets - Values may not belong to singleton buckets if
tuples increase in equi-depth buckets - Number of equi-depth buckets needs adjustment
- No. of tuples in equi-depth buckets needs
adjustment
17Solutions to the challenges
- Large number of same value will cause an
equi-depth bucket to split but the adjacent
boundaries will have same value, hint create
singleton bucket for that value - allow singleton buckets with small counts to be
merged back into equi-depth buckets. - Split and merge technique to control imbalance
between equi-depth buckets and their tuples
without recomputation
18To handle deletion and modification
- Deletion can decrease number of tuples in a
bucket relative to another bucket , making a
singleton bucket - can drop a bucket count to the lower threshold
TL. - What to do?
- Merge the pair with smallest combined count and
split the bucket with largest count - Else recompute from backing sample
19Conclusion
- Backing sample
- Incremental maintenance of equi-depth and
compressed histograms - Split and merge technique to reduce access to
backing sample
20Use of histograms in Commercial database