Title: Overcoming Limitations of Sampling for Aggregation Queries
1Overcoming Limitations of Sampling for
Aggregation Queries
- Surajit Chaudhuri, Microsoft Research
- Gautam Das, Microsoft Research
- Mayur Datar, Stanford University
- Rajiv Motwani, Stanford University
- Vivek Narasayya, Microsoft Research
Presented by Daniel Kuang CSE_at_UTA
2Outline
- Introduction
- Limitations of uniform random sampling
- Handling Data skew
- Handling low selectivity small groups
- Implementation result
- Conclusion
Daniel Kuang CSE_at_UTA
3Introduction
- Approximate aggregation query processing using
sampling - Uniform sampling performs poorly when
- Distribution of aggregated attribute is skewed
- low selectivity
- Solution proposed by the paper
- Outlier indexing
- Weighted sampling based on workload information
- A combined approach significantly reduces the
approximation error.
Daniel Kuang CSE_at_UTA
4Limitations of uniform random sampling an
example
- Relation R with 10,000 tuples aggregate on column
C - 99 tuples have value 1 - 9900
- 1 tuples have value 1000 100, 000
- 1 URS of R Sample size 100
- sum result sum of sample x scale factor(100)
- All of 100 tuples have value 1 in sample
- estimated result 100x100 10, 000
- 2 or more tuples have value 1,000 in sample
- estimated result gt 209, 800
- Reasonable result only when we get exactly one
tuple of value 1000 per sample probability 0.37 - With probability of 0.63, we would get a large
error in the estimate.
Actual sum 109, 900
Daniel Kuang CSE_at_UTA
5Limitations of uniform random sampling skewed
data
- Contribution of the outliers to the error in
estimating sum via uniform sampling - Relation R size N having values y1, y2, yN.
Let U be a uniform random sample of yis of size
n. Then - Actual sum
- Estimated sum
- Standard error
- Where S is the standard deviation of the values
in the relation defined as - If there are outliers in the data then S could be
very large. For a given error bound we need to
increase the sample size n.
yi
Ye (N/n)
Daniel Kuang CSE_at_UTA
6Limitations of uniform random sampling low
selectivity
- Selection query partitions relation into 2
sub-relations - Tuples that satisfy the condition relevant
sub-relation - Tuples that do not satisfy the condition
- Number of tuples sampled from relevant
sub-relation is proportional to its size. - If this relevant sample size is small due to low
selectivity of the query, it may lead to larger
error - Success of uniform sampling depends on the size
of the relevant sub-relation
Daniel Kuang CSE_at_UTA
7Handling data skew outlier indexes
- Idea of outlier indexing identify outlier
tuples and store them separately - Example selection query (Q) with sum aggregate
- Partition to two sub-relations Ro and RNO.
- Apply the query to the sub-relation of outliers.
- Apply the query to the sub-relation of RNO.
- Combine the above two to get overall estimate
result of query
Daniel Kuang CSE_at_UTA
8Handling data skew outlier indexes an example
- Aggregate Query Q - Select sum (sales) from
lineitem - Preprocessing
- 1. Determine outliers lineitem_outlier view
- 2. Sample non-outliers sample table
lineitem_samp - Query processing
- 1. Aggregate outliers sum(sales) from
lineitem_outlier view - 2. Aggregate non-outliers apply the query to
sample T and extrapolate sum(sales) for
lineitem_samp scale factor - 3. Combine aggregates approximate result from
RNO exact result from RO approximate result
for R
Daniel Kuang CSE_at_UTA
9Handling data skew outlier indexes selection
of outliers
- Query error is solely due to error in estimating
the aggregate from non-outliers - Additional overhead of maintaining and accessing
the outlier index - An optimal sub-relation Ro that leads to the
minimum possible sampling error
Daniel Kuang CSE_at_UTA
10Handling data skew outlier indexes selection
of outliers
- Theorem 2
- Consider a multiset R y1, y2,yN in sorted
order. Ro C R such that - Ro t
- S(R\ Ro) min RC R ,R
tS(R\R) - There exists some 0 t t such that Ro yi1
i t Uyi(N t 1- t) i N - States that the subset that minimizes the
standard deviation over the remaining set
consists of the leftmost t elements and the
right most t- t elements from the multiset R
when elements are arranged in sorted order.
Daniel Kuang CSE_at_UTA
11Handling data skew outlier indexes selection
of outliers
Alogirithm Outlier-Index(R,C, t) Let the values
in column C be sorted in relation R For i 1 to
t1, compute E(i) S(yi, yi1,yN-
ti-1) Let i be the value of i where E(i) is
minimum. Outlier-index is the tuples that
correspond to the set of values yj1 j t
Uyj(N t 1- t) j N where t i - 1
Daniel Kuang CSE_at_UTA
12Handling data skew outlier indexes storage
allocation
Given sufficient space to store m tuples, how do
we allocate storage between samples and
outlier-index in order to minimize error? S(t) is
the standard deviation in non-outliers for an
optimal outlier-index of t. If we allocate space
m such that t tuples in the outlier and m- t in
the sample. Then error will be proportional to
S(t)/v(m- t). Finding the value of t for which
S(t)/v(m- t) will be minimum.
Daniel Kuang CSE_at_UTA
13Use of outlier-indexing extension to other
aggregates
- Avg aggregate same as sum aggregate
- Count aggregate outlier-indexing not beneficial
since there is no variance among the data values - Rank order dependent aggregates such as min,
max, and median outlier not useful
Daniel Kuang CSE_at_UTA
14Handling low selectivity and small groups
- Use weighted sampling with the help of workload
information instead of uniform sampling - More samples from the subsets that are small in
size but are more important
Daniel Kuang CSE_at_UTA
15Handling low selectivity and small groups
Exploit workload info
- Workload collection collect workload
information consisting of representative queries
from query profilers and query logs - Trace query patterns parsed information e.g.
the set of selection conditions - Trace tuple usage usage of specific tuples like
frequency of access to each tuple - Weighted sampling perform sampling by taking
into account weights of the tuples
Daniel Kuang CSE_at_UTA
16Handling low selectivity and small groups
Weighted Sampling
- Let the weight of tuple ti be wi
- Normalized weight be wi wi/ Swi
- probability of acceptance of this tuple in the
sample pi n.wi - For each tuple included in the sample, store
corresponding pi. - Each aggregate computed over this tuple gets
multiplied by the inverse of pi - works well if
- Access pattern is local i.e. most queries access
a small part of relation - Workload is a good representative of actual
queries
Daniel Kuang CSE_at_UTA
17Implementation Experimental Result
- Parameters
- (1) skew of the data (z) -- 1, 1.5, 2, 2.5, and 3
- (2) sampling fraction (f) -- from 1 to 100
- (3) storage for the outlier-index - 1, 5, I0,
and 20
- Comparisons
- (1) Uniform sampling
- (2) Weighted sampling
- (3) weighted sampling outlier-indexing
Daniel Kuang CSE_at_UTA
18Experimental Result - Varying the data skew
Daniel Kuang CSE_at_UTA
19Experimental Result - Varying the sampling
fraction
Daniel Kuang CSE_at_UTA
20Experimental Result - Varying the selectivity of
queries
Daniel Kuang CSE_at_UTA
21Conclusion and Open Problems
- Skew can lead to large errors - outlier-indexing
significantly reduce the error. - Problem of low selectivity of queries Weighted
sampling based on workload information.
Daniel Kuang CSE_at_UTA
22Daniel Kuang CSE_at_UTA