Title: Endbiased Samples for Join Cardinality Estimation
1End-biased Samples for Join Cardinality Estimation
- Cristian Estan, Jeffrey F. Naughton
- Computer Sciences Department
- University of Wisconsin-Madison
2Problem description
- Estimating join size
- Not restricted to key-foreign key joins
- Based on summaries of the two tables computed
separately - Two main contributions of this paper
- Proposing a new type of summaries based on a
special type of sampling - Extensive experimental comparison of many types
of summaries
3We can get more accurate estimates!
- AGMS99 showed that on certain data sets
- All summaries give inaccurate estimates
- Estimates based on random sampling are within
constant factor of bound - We show that
- On other data sets, our estimates significantly
more accurate than those with random sampling - No known summaries give estimates more accurate
than all others for every data set
4Overview
- End-biased samples
- Theoretical comparison against other
sampling-based methods - Experimental comparison against sketches and
histograms
5Building the end-biased samples
- If frequency of every value known for both tables
? exact join size - We keep a sample of this data
- Sampling probability proportional to frequency
DLT01 - Sampling decisions correlated by using a shared
hash function F90,DG00,EKMV04
Frequency of values of join attribute in table A
Frequency of values of join attribute in table B
(c,10)
(d,1)
(g,1)
(g,1)
(m,2)
(m,1)
(s,5)
(r,7)
(t,1)
(z,1)
6Estimating join size
- Let av be the frequency of value v in table A, bv
in B and pv the probability that v is selected
into both samples - Sum contribution of values in both samples (av
bv/pv) to estimate join size - If av Ta and bvTb , pv 1
- If av Ta and bvltTb , pv bv/Tb
- If av ltTa and bvTb , pv av/Ta
- If av ltTa and bvltTb , pv min(av/Ta,bv/Tb)
7Why correlate the samples?
- Example tables with 1000 values appearing once,
50 values common to both tables - We sample with probability 1/10
- Sample size 100 for each table
- Comparison
- pv
- Common values sampled
- Join size estimate
8Comparison of sampling methods
9Overview
- End-biased samples
- Theoretical comparison against other
sampling-based methods - Experimental comparison against sketches and
histograms
10Experimental methodology
- Randomly generated tables with 1,000,000 tuples
- Explored multiple configurations
- Varied the peakedness of the distribution
- Varied memory budget from 204 to 659,456 words
- Varied the amount of correlation between tables
- Uncorrelated tables generated independently
- Positively correlated frequent values likely
same in both tables - Negatively correlated unlikely frequent values
same in the two tables - 1,000 runs for each configuration
11Summaries compared
- End-biased samples
- End-biased equi-depth histograms PC84
- Sketches AGMS99,DGGR02,GGR04
- Concise samples GM98
- Counting samples GM98
12Comparison with histograms
13Comparison with sketches
14Memory comparison
15Qualitative comparison
16Conclusions
- End-biased samples and sketches are the best
summaries for the join size estimation problem
addressed in this paper - End-biased samples are compelling if
- Selections on the join attribute are required
- Summaries must be very concise
- The frequencies of join attributes in the two
tables are strongly correlated
17Questions?
Thank you!
Scripts and results for experiments available at
http//www.cs.wisc.edu/estan/ebs.tar.gz
18Estimating the join size
19Related work sampling methods
- GM98 concise samples, counting samples
- DLT01 smart sampling
- F90,EKMV04 using a hash function to select
values used as summary of data
20Related work join size estimation
- Histograms
- Multidimensional histograms
- GG02,GK04 Wavelets
- AGMS99,DGGR02,GGR04 Sketches
21Variance of join size estimate
- No slide, point to the paper.