Overcoming Limitations of Sampling for Aggregation Queries - PowerPoint PPT Presentation

About This Presentation

Title:

Overcoming Limitations of Sampling for Aggregation Queries

Description:

Distribution of aggregated attribute is skewed. low selectivity. Solution proposed by the paper ... Avg aggregate: same as sum aggregate ... – PowerPoint PPT presentation

Number of Views:160

Avg rating:3.0/5.0

Slides: 23

Provided by: Ran54

Learn more at: https://crystal.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: Overcoming Limitations of Sampling for Aggregation Queries

1
Overcoming Limitations of Sampling for
Aggregation Queries

Surajit Chaudhuri, Microsoft Research
Gautam Das, Microsoft Research
Mayur Datar, Stanford University
Rajiv Motwani, Stanford University
Vivek Narasayya, Microsoft Research

Presented by Daniel Kuang CSE_at_UTA
2
Outline

Introduction
Limitations of uniform random sampling
Handling Data skew
Handling low selectivity small groups
Implementation result
Conclusion

Daniel Kuang CSE_at_UTA
3
Introduction

Approximate aggregation query processing using
sampling
Uniform sampling performs poorly when
Distribution of aggregated attribute is skewed
low selectivity
Solution proposed by the paper
Outlier indexing
Weighted sampling based on workload information
A combined approach significantly reduces the
approximation error.

Daniel Kuang CSE_at_UTA
4
Limitations of uniform random sampling an
example

Relation R with 10,000 tuples aggregate on column
C
99 tuples have value 1 - 9900
1 tuples have value 1000 100, 000
1 URS of R Sample size 100
sum result sum of sample x scale factor(100)
All of 100 tuples have value 1 in sample
estimated result 100x100 10, 000
2 or more tuples have value 1,000 in sample
estimated result gt 209, 800
Reasonable result only when we get exactly one
tuple of value 1000 per sample probability 0.37
With probability of 0.63, we would get a large
error in the estimate.

Actual sum 109, 900
Daniel Kuang CSE_at_UTA
5
Limitations of uniform random sampling skewed
data

Contribution of the outliers to the error in
estimating sum via uniform sampling
Relation R size N having values y1, y2, yN.
Let U be a uniform random sample of yis of size
n. Then
Actual sum
Estimated sum
Standard error
Where S is the standard deviation of the values
in the relation defined as
If there are outliers in the data then S could be
very large. For a given error bound we need to
increase the sample size n.

yi
Ye (N/n)
Daniel Kuang CSE_at_UTA
6
Limitations of uniform random sampling low
selectivity

Selection query partitions relation into 2
sub-relations
Tuples that satisfy the condition relevant
sub-relation
Tuples that do not satisfy the condition
Number of tuples sampled from relevant
sub-relation is proportional to its size.
If this relevant sample size is small due to low
selectivity of the query, it may lead to larger
error
Success of uniform sampling depends on the size
of the relevant sub-relation

Daniel Kuang CSE_at_UTA
7
Handling data skew outlier indexes

Idea of outlier indexing identify outlier
tuples and store them separately
Example selection query (Q) with sum aggregate
Partition to two sub-relations Ro and RNO.
Apply the query to the sub-relation of outliers.
Apply the query to the sub-relation of RNO.
Combine the above two to get overall estimate
result of query

Daniel Kuang CSE_at_UTA
8
Handling data skew outlier indexes an example

Aggregate Query Q - Select sum (sales) from
lineitem
Preprocessing
1. Determine outliers lineitem_outlier view
2. Sample non-outliers sample table
lineitem_samp
Query processing
1. Aggregate outliers sum(sales) from
lineitem_outlier view
2. Aggregate non-outliers apply the query to
sample T and extrapolate sum(sales) for
lineitem_samp scale factor
3. Combine aggregates approximate result from
RNO exact result from RO approximate result
for R

Daniel Kuang CSE_at_UTA
9
Handling data skew outlier indexes selection
of outliers

Query error is solely due to error in estimating
the aggregate from non-outliers
Additional overhead of maintaining and accessing
the outlier index
An optimal sub-relation Ro that leads to the
minimum possible sampling error

Daniel Kuang CSE_at_UTA
10
Handling data skew outlier indexes selection
of outliers

Theorem 2
Consider a multiset R y1, y2,yN in sorted
order. Ro C R such that
Ro t
S(R\ Ro) min RC R ,R
tS(R\R)
There exists some 0 t t such that Ro yi1
i t Uyi(N t 1- t) i N
States that the subset that minimizes the
standard deviation over the remaining set
consists of the leftmost t elements and the
right most t- t elements from the multiset R
when elements are arranged in sorted order.

Daniel Kuang CSE_at_UTA
11
Handling data skew outlier indexes selection
of outliers
Alogirithm Outlier-Index(R,C, t) Let the values
in column C be sorted in relation R For i 1 to
t1, compute E(i) S(yi, yi1,yN-
ti-1) Let i be the value of i where E(i) is
minimum. Outlier-index is the tuples that
correspond to the set of values yj1 j t
Uyj(N t 1- t) j N where t i - 1
Daniel Kuang CSE_at_UTA
12
Handling data skew outlier indexes storage
allocation
Given sufficient space to store m tuples, how do
we allocate storage between samples and
outlier-index in order to minimize error? S(t) is
the standard deviation in non-outliers for an
optimal outlier-index of t. If we allocate space
m such that t tuples in the outlier and m- t in
the sample. Then error will be proportional to
S(t)/v(m- t). Finding the value of t for which
S(t)/v(m- t) will be minimum.
Daniel Kuang CSE_at_UTA
13
Use of outlier-indexing extension to other
aggregates

Avg aggregate same as sum aggregate
Count aggregate outlier-indexing not beneficial
since there is no variance among the data values
Rank order dependent aggregates such as min,
max, and median outlier not useful

Daniel Kuang CSE_at_UTA
14
Handling low selectivity and small groups

Use weighted sampling with the help of workload
information instead of uniform sampling
More samples from the subsets that are small in
size but are more important

Daniel Kuang CSE_at_UTA
15
Handling low selectivity and small groups
Exploit workload info

Workload collection collect workload
information consisting of representative queries
from query profilers and query logs
Trace query patterns parsed information e.g.
the set of selection conditions
Trace tuple usage usage of specific tuples like
frequency of access to each tuple
Weighted sampling perform sampling by taking
into account weights of the tuples

Daniel Kuang CSE_at_UTA
16
Handling low selectivity and small groups
Weighted Sampling

Let the weight of tuple ti be wi
Normalized weight be wi wi/ Swi
probability of acceptance of this tuple in the
sample pi n.wi
For each tuple included in the sample, store
corresponding pi.
Each aggregate computed over this tuple gets
multiplied by the inverse of pi
works well if
Access pattern is local i.e. most queries access
a small part of relation
Workload is a good representative of actual
queries

Daniel Kuang CSE_at_UTA
17
Implementation Experimental Result

Parameters
(1) skew of the data (z) -- 1, 1.5, 2, 2.5, and 3
(2) sampling fraction (f) -- from 1 to 100
(3) storage for the outlier-index - 1, 5, I0,
and 20

Comparisons
(1) Uniform sampling
(2) Weighted sampling
(3) weighted sampling outlier-indexing

Daniel Kuang CSE_at_UTA
18
Experimental Result - Varying the data skew
Daniel Kuang CSE_at_UTA
19
Experimental Result - Varying the sampling
fraction
Daniel Kuang CSE_at_UTA
20
Experimental Result - Varying the selectivity of
queries
Daniel Kuang CSE_at_UTA
21
Conclusion and Open Problems

Skew can lead to large errors - outlier-indexing
significantly reduce the error.
Problem of low selectivity of queries Weighted
sampling based on workload information.

Daniel Kuang CSE_at_UTA
22