Overcoming Limitations of Sampling for Aggregation Queries - PowerPoint PPT Presentation

About This Presentation
Title:

Overcoming Limitations of Sampling for Aggregation Queries

Description:

Distribution of aggregated attribute is skewed. low selectivity. Solution proposed by the paper ... Avg aggregate: same as sum aggregate ... – PowerPoint PPT presentation

Number of Views:160
Avg rating:3.0/5.0
Slides: 23
Provided by: Ran54
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Overcoming Limitations of Sampling for Aggregation Queries


1
Overcoming Limitations of Sampling for
Aggregation Queries
  • Surajit Chaudhuri, Microsoft Research
  • Gautam Das, Microsoft Research
  • Mayur Datar, Stanford University
  • Rajiv Motwani, Stanford University
  • Vivek Narasayya, Microsoft Research

Presented by Daniel Kuang CSE_at_UTA
2
Outline
  • Introduction
  • Limitations of uniform random sampling
  • Handling Data skew
  • Handling low selectivity small groups
  • Implementation result
  • Conclusion

Daniel Kuang CSE_at_UTA
3
Introduction
  • Approximate aggregation query processing using
    sampling
  • Uniform sampling performs poorly when
  • Distribution of aggregated attribute is skewed
  • low selectivity
  • Solution proposed by the paper
  • Outlier indexing
  • Weighted sampling based on workload information
  • A combined approach significantly reduces the
    approximation error.

Daniel Kuang CSE_at_UTA
4
Limitations of uniform random sampling an
example
  • Relation R with 10,000 tuples aggregate on column
    C
  • 99 tuples have value 1 - 9900
  • 1 tuples have value 1000 100, 000
  • 1 URS of R Sample size 100
  • sum result sum of sample x scale factor(100)
  • All of 100 tuples have value 1 in sample
  • estimated result 100x100 10, 000
  • 2 or more tuples have value 1,000 in sample
  • estimated result gt 209, 800
  • Reasonable result only when we get exactly one
    tuple of value 1000 per sample probability 0.37
  • With probability of 0.63, we would get a large
    error in the estimate.

Actual sum 109, 900
Daniel Kuang CSE_at_UTA
5
Limitations of uniform random sampling skewed
data
  • Contribution of the outliers to the error in
    estimating sum via uniform sampling
  • Relation R size N having values y1, y2, yN.
    Let U be a uniform random sample of yis of size
    n. Then
  • Actual sum
  • Estimated sum
  • Standard error
  • Where S is the standard deviation of the values
    in the relation defined as
  • If there are outliers in the data then S could be
    very large. For a given error bound we need to
    increase the sample size n.

yi
Ye (N/n)
Daniel Kuang CSE_at_UTA
6
Limitations of uniform random sampling low
selectivity
  • Selection query partitions relation into 2
    sub-relations
  • Tuples that satisfy the condition relevant
    sub-relation
  • Tuples that do not satisfy the condition
  • Number of tuples sampled from relevant
    sub-relation is proportional to its size.
  • If this relevant sample size is small due to low
    selectivity of the query, it may lead to larger
    error
  • Success of uniform sampling depends on the size
    of the relevant sub-relation

Daniel Kuang CSE_at_UTA
7
Handling data skew outlier indexes
  • Idea of outlier indexing identify outlier
    tuples and store them separately
  • Example selection query (Q) with sum aggregate
  • Partition to two sub-relations Ro and RNO.
  • Apply the query to the sub-relation of outliers.
  • Apply the query to the sub-relation of RNO.
  • Combine the above two to get overall estimate
    result of query

Daniel Kuang CSE_at_UTA
8
Handling data skew outlier indexes an example
  • Aggregate Query Q - Select sum (sales) from
    lineitem
  • Preprocessing
  • 1. Determine outliers lineitem_outlier view
  • 2. Sample non-outliers sample table
    lineitem_samp
  • Query processing
  • 1. Aggregate outliers sum(sales) from
    lineitem_outlier view
  • 2. Aggregate non-outliers apply the query to
    sample T and extrapolate sum(sales) for
    lineitem_samp scale factor
  • 3. Combine aggregates approximate result from
    RNO exact result from RO approximate result
    for R

Daniel Kuang CSE_at_UTA
9
Handling data skew outlier indexes selection
of outliers
  • Query error is solely due to error in estimating
    the aggregate from non-outliers
  • Additional overhead of maintaining and accessing
    the outlier index
  • An optimal sub-relation Ro that leads to the
    minimum possible sampling error

Daniel Kuang CSE_at_UTA
10
Handling data skew outlier indexes selection
of outliers
  • Theorem 2
  • Consider a multiset R y1, y2,yN in sorted
    order. Ro C R such that
  • Ro t
  • S(R\ Ro) min RC R ,R
    tS(R\R)
  • There exists some 0 t t such that Ro yi1
    i t Uyi(N t 1- t) i N
  • States that the subset that minimizes the
    standard deviation over the remaining set
    consists of the leftmost t elements and the
    right most t- t elements from the multiset R
    when elements are arranged in sorted order.

Daniel Kuang CSE_at_UTA
11
Handling data skew outlier indexes selection
of outliers
Alogirithm Outlier-Index(R,C, t) Let the values
in column C be sorted in relation R For i 1 to
t1, compute E(i) S(yi, yi1,yN-
ti-1) Let i be the value of i where E(i) is
minimum. Outlier-index is the tuples that
correspond to the set of values yj1 j t
Uyj(N t 1- t) j N where t i - 1
Daniel Kuang CSE_at_UTA
12
Handling data skew outlier indexes storage
allocation
Given sufficient space to store m tuples, how do
we allocate storage between samples and
outlier-index in order to minimize error? S(t) is
the standard deviation in non-outliers for an
optimal outlier-index of t. If we allocate space
m such that t tuples in the outlier and m- t in
the sample. Then error will be proportional to
S(t)/v(m- t). Finding the value of t for which
S(t)/v(m- t) will be minimum.
Daniel Kuang CSE_at_UTA
13
Use of outlier-indexing extension to other
aggregates
  • Avg aggregate same as sum aggregate
  • Count aggregate outlier-indexing not beneficial
    since there is no variance among the data values
  • Rank order dependent aggregates such as min,
    max, and median outlier not useful

Daniel Kuang CSE_at_UTA
14
Handling low selectivity and small groups
  • Use weighted sampling with the help of workload
    information instead of uniform sampling
  • More samples from the subsets that are small in
    size but are more important

Daniel Kuang CSE_at_UTA
15
Handling low selectivity and small groups
Exploit workload info
  • Workload collection collect workload
    information consisting of representative queries
    from query profilers and query logs
  • Trace query patterns parsed information e.g.
    the set of selection conditions
  • Trace tuple usage usage of specific tuples like
    frequency of access to each tuple
  • Weighted sampling perform sampling by taking
    into account weights of the tuples

Daniel Kuang CSE_at_UTA
16
Handling low selectivity and small groups
Weighted Sampling
  • Let the weight of tuple ti be wi
  • Normalized weight be wi wi/ Swi
  • probability of acceptance of this tuple in the
    sample pi n.wi
  • For each tuple included in the sample, store
    corresponding pi.
  • Each aggregate computed over this tuple gets
    multiplied by the inverse of pi
  • works well if
  • Access pattern is local i.e. most queries access
    a small part of relation
  • Workload is a good representative of actual
    queries

Daniel Kuang CSE_at_UTA
17
Implementation Experimental Result
  • Parameters
  • (1) skew of the data (z) -- 1, 1.5, 2, 2.5, and 3
  • (2) sampling fraction (f) -- from 1 to 100
  • (3) storage for the outlier-index - 1, 5, I0,
    and 20
  • Comparisons
  • (1) Uniform sampling
  • (2) Weighted sampling
  • (3) weighted sampling outlier-indexing

Daniel Kuang CSE_at_UTA
18
Experimental Result - Varying the data skew
Daniel Kuang CSE_at_UTA
19
Experimental Result - Varying the sampling
fraction
Daniel Kuang CSE_at_UTA
20
Experimental Result - Varying the selectivity of
queries
Daniel Kuang CSE_at_UTA
21
Conclusion and Open Problems
  • Skew can lead to large errors - outlier-indexing
    significantly reduce the error.
  • Problem of low selectivity of queries Weighted
    sampling based on workload information.

Daniel Kuang CSE_at_UTA
22
  • Questions ?

Daniel Kuang CSE_at_UTA
Write a Comment
User Comments (0)
About PowerShow.com