Overcoming Limitations of Sampling for Aggregation Queries

About This Presentation

Title:

Overcoming Limitations of Sampling for Aggregation Queries

Description:

Overcoming Limitations of Sampling for Aggregation ... Weighted sampling based on workload information ... Unbiased estimator. Actual sum. Standard error ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 30

Provided by: Ran54

Learn more at: https://crystal.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: Overcoming Limitations of Sampling for Aggregation Queries

1
Overcoming Limitations of Sampling for
Aggregation Queries

Surajit Chaudhuri, Microsoft Research
Gautam Das, Microsoft Research
Mayur Datar, Stanford University
Rajiv Motwani, Stanford University
Vivek Narasayya, Microsoft Research

Presented by Ranjan Dash CSE_at_UTA
2
Outline

Abstract
Introduction
Related Work
Study of Limitations
Handling Data skew
Handling low selectivity small groups
Implementation and experimental result
Conclusion and future work

3
Abstract

Approximate aggregation query processing using
sampling
Uniform sampling performs poorly when -
Distribution of aggregated attribute is skewed
Queries with low selectivity
Solution proposed by the paper -
Outlier indexing
Weighted sampling based on workload information
A Combined approach significantly reduces the
approximation error.

4
Introduction

Approximate query processing used extensively by
OLAP and Data mining tools.
Use of uniform random sampling is error prone
because of -
Presence of data skew presence of outlier
values that are significantly different from the
rest in terms of their contribution to the
aggregate
Low selectivity Very few or no tuples may
satisfy the query predicate. Extrapolating from
such a small set of tuples led to error.
Two techniques to overcome these limitations of
URS.
Isolate values in the data that could contribute
heavily to the error in sampling
Outlier-indexing
Exploit the workload information for queries with
low selectivity.
Single table queries involving selection and
group by queries with sum aggregate can be
extended to other aggregation functions and
foreign key joins.
Combination of outlier-indexing and weighted
sampling based on workload info results in
significant error reduction.

5
Related Work

Related research in approximately answering
aggregation queries
Use of online sampling technique for data with
little or no variability 12
Use of pre-computed samples or synopsis in
answering aggregation queries 2
Techniques for fast incremental maintenance of
summery statistics to provide approximate query
answer 9
Proposal of a weighted sampling scheme 1
Use of weighted sampling to continuously tuning
representative sample of data
Concept of detecting and removing deviants in
time series data 14

6
Limitations of uniform random sampling an
example

Adverse impact of data skew on aggregation
queries Relation R with 10,000 tuples
aggregate on column C
99(9900) tuples 1 9900
1(100) tuples 1000 100, 000
1 URS of R Sample size 100
result sum of sample x scale factor(100)
Quite likely sample would not include any tuple
of value 1000 estimated result 100x100 10,
000
2 or more tuples of value 1000 gets included in
sample estimated result gt 209, 800
Reasonable result only when we get exactly one
tuple of value 1000 per sample This event has
probability 0.37.
With probability of 0.63, we would get a large
error in the estimate.

109, 900
Estimated result gtgt true sum 109, 900
7
Limitations of uniform random sampling skewed
data

Contribution of outliers to the error in
estimating sum via uniform sampling -
Relation R of size N having values y1, y2,
yN. Let U be a uniform random sample of yis
of size n. Then
Unbiased estimator
Actual sum
Standard error
Where S is the standard deviation of the values
in the relation defined as
If there are outliers in the data then S could be
very large. For a given error bound we need to
increase the sample size n.

Theorem 1
Ye (N/n)
yi
8
Limitations of uniform random sampling low
selectivity

Selection query partitions relation into 2
sub-relation
Tuples that satisfy the condition relevant
sub-relation
Tuples that do not satisfy the condition
Number of tuples sampled from relevant
sub-relation is proportional to its size.
If this relevant sample size is small due to low
selectivity of the query, it may lead to larger
error
Group by queries partition the relation into many
sub-relations
Success of uniform sampling depends on the size
of the relevant sub-relation

9
Handling data skew outlier indexes

Large variance in aggregate column due to
presence of outliers or deviants lead to large
error.
Natural idea is to deal with it separately
outlier indexing
Example selection query (Q) with sum aggregate
Separate the outliers into a separate
sub-relation (RO)
Apply the query to this sub-relation of outliers
and get the true result of this query.
Pick a uniform random sample from the non-outlier
part of the relation (RNO) and estimate the
approximate true result using theorem 1.
Combine the above two to get overall estimate of
querys true result.

10
Handling data skew outlier indexes an example

Aggregate Query Q - Select sum (sales) from
lineitem
Preprocessing
1. Partition the relation into RO (outliers) and
RNO (non-outliers)
2. Ro is lineitem_outlier a view
3. select a uniform random sample T from relation
RNO and materialize it in a table called
lineitem_samp
Query processing
1. Aggregate outliers sum(sales) from
lineitem_outlier
2. Aggregate non-outliers apply the query to
sample T and extrapolate sum(sales) for
lineitem_samp x scale factor
3. Combine aggregates approximate result from
RNO exact result for RO approximate result
for R

11
Handling data skew outlier indexes selection
of outliers

Query error is solely due to error in estimating
the aggregate from non-outliers
Additional overhead of maintaining and accessing
the outlier index Ro contains at most ttuples
Paper proposes an algorithm based on a theorem to
decide the optimal size of outlier-index
An optimal outlier-index Ro(R,C,t) is defined as
a sub-relation Ro c R such that
Ro t
E(R\ Ro) min RC R ,R tE(R\R)
Defines outlier index as an optimal sub-relation
Ro that leads to the minimum possible sampling
error, subject to the constraint that Ro has at
most t tuples in R

Definition 1
12
Handling data skew outlier indexes selection
of outliers

Theorem 2
Consider a multiset R y1, y2,yN in sorted
order. Ro C R such that
Ro t
S(R\ Ro) min RC R ,R tS(R\R)
There exists some 0 t t such that Ro yi1
i t Uyi(N t 1- t) i N
States that the subset that minimizes the
standard deviation over the remaining set
consists of the leftmost t elements and the
right most t- t elements from the multiset R
when elements are arranged in sorted order.

13
Handling data skew outlier indexes selection
of outliers
Alogirithm Outlier-Index(R,C, t) Let the values
in col C be sorted. For i 1 to t1, compute
E(i) S(yi, yi1,yN- ti-1) Let i be the
value of i where E(i) is minimum. Outlier-index
is the tuples that correspond to the set of
values yj1 j t Uyj(N t 1- t) j N
where t i - 1
14
Handling data skew outlier indexes storage
allocation
Given sufficient space to store m tuples, how do
we allocate storage between samples and
outlier-index inorder to minimize error? S(t) is
the SD in non-outliers for an optimal
outlier-index of t. If we allocate space in m
such that t tuples in the outlier and m- t in the
sample. Then error will be proportional to
S(t)/v(m- t). Then the optimal allocation will be
the value of t for which S(t)/v(m- t) will be
minimum.
15
Use of outlier-indexing extension to other
aggregates

Count aggregate outlier-indexing not beneficial
as there is no variance among the data values
Avg aggregate same as sum aggregate
Extensible to class of aggregates that satisfy
certain algebraic property
Real valued function aggregate like
sum(pricequantity)
Rank order dependent aggregates like min, max
etc outlier not useful
Extension to foreign-key join
Outlier index and samples are computed over the
relation having the aggregation column and then
joined with other relations at query processing
time.

16
Handling low selectivity and small groups

Problem of low selectivity queries and small
groups in group-by queries.
Use weighted sampling instead of uniform sampling
with the help of workload information
Sample more from the subsets that are small in
size but are important having high usage
Usage of a database is typically characterized by
considerable locality in the access pattern i.e
queries against the database access certain parts
more than others. Tune the sample to a
representative workload
This technique uses pre-computed samples

17
Handling low selectivity and small groups
Exploit workload info

Workload collection collect workload consisting
of representative queries from query profilers
and query logs
Trace query patterns set of selection condition
Trace tuple usage usage of specific tuples like
frequency of access to each tuple
tuple ti has weight wi if the tuple ti is
required to answer wi of the queries in the
workload.
Weighted sampling perform sampling taking into
account weights of the tuples

18
Handling low selectivity and small groups
Exploit workload info an example

Use of weighted sample to answer aggregation
query
Let the weight of tuple ti be wi
Normalized weight be wi wi/ Swi
probability of acceptance of this tuple in the
sample pi n.wi where n is the sample size
How to answer aggregation queries approximately?
For each tuple included in the sample, store the
corresponding pi.
Each aggregate computed over this tuple gets
multiplied by the inverse of pi (multiplication
factor).
This works well if
Access pattern is local i.e. most queries access
a small part of relation
Workload is representative of actual queries

19
Implementation

Outlier-indexing
Table lineitem, Aggregation column
I_extendedprice
CREATE VIEW I-extendedprice-otl-idx AS SELECT
from lineitem
WHERE (I-extendedprice 5 54819.46) OR
(I-extendedprice 2 71442.88)
predicate is determined using the algorithm and
storage allocated for the outlier-index.
Module that automatically rewrites a, query to
use the outlier-index and the sample rather than
the fact table.
Uniform and weighted sampling.
Modify the execution tree generated by the SQL
Server optimizer
For uniform sampling accepts tuples with the
specified probability (i.e., the sampling
fraction) and stores the accepted tuples in a
table.
For weighted sampling, the probability of
accepting a tuple is proportional to the weight
associated with the tuple.
Calculated exact weights for each tuple for a
given workload and store in an additional column
in fact table (lineitem) .
Converted a SELECT query in the workload into the
corresponding UPDATE statement that increments
the weight of all the tuples satisfying the
selection predicates.
Automatically substitute the sample table for the
fact table of an incoming query.

20
Implementation Experimental Result

Goal is to compare performance of uniform
sampling, weighted sampling and weighted sampling
outlier-indexing
Platform. Microsoft SQL Server 2000. Dell
Precision 610 system with a Pentium I11 Xeon 450
Mhz processor with 128 MB RAM and an external
23GB hard drive.
Databases. TPC-R benchmark that supports data
generation from a uniform distribution.
Modified the TPC-R data generation program to
generate data with varying degree of skew using
varying Zipfian parameter (z)
The ratio of the maximum value to the minimum
value of the aggregation column varied between 76
and 106.
Workloads. generated using a random query
generation program.
The program generates queries with
foreign key joins between tables,
aggregations on the fact table (lineitem)
grouping
selection.
The sum aggregation function is used.

21
Implementation Experimental Result

Parameters
(1) skew of the data (z) over 1, 1.5, 2, 2.5,
and 3
(2) the sampling fraction (f) - from 1 to loo
(3) the storage for the outlier-index - 1, 5,
I0, and 20.
Error metric
For each query compute the relative error by
dividing the difference between the approximate
estimate for the aggregate (sum) and the accurate
value of the aggregate by the latter.
For queries with a group-by clause Consider a
query that has k groups in the answer obtained
from actually executing the query. Build a k
dimensional vector where the ith dimension
contains the relative error in the aggregate
expression for that ith group. The error is the
mean of all the points in the vector. Thus, the
average relative error over all groups is
reported. The error metric for a workload is
average error over all queries in the workload.

22
Implementation Experimental Result

Uniform sampling
weighted sampling (WSAMP)
weighted sampling outlier-indexing
(WSAMPOTLIDX).

23
Experimental Result - Varying the data skew

Vary the data skew, while keeping the sampling
fraction constant (5).
Weighted sampling does consistently better than
uniform sampling across all data skews
Weighted sampling outlier-indexing performs
significantly better than weighted sampling alone.

24
Experimental Result - Varying the sampling
fraction

varied the sampling fraction with fixed data skew
(z2).
weighted sampling gives lower error than uniform
sampling across all sampling fractions.
greatest benefit occurs at low sampling fractions
(e.g., 1).
use of outlier-indexing in addition to weighted
sampling improves accuracy significantly.

25
Experimental Result - Varying the selectivity of
queries

vary selectivity of queries between 1 and 100
with fixed data skew (z2) and the sampling
fraction (fl).
weighted sampling better than uniform sampling at
very low selectivity.
when the workload references large portions of
the data (e.g., at 100 selectivity) uniform and
weighted sampling are not significantly
different.
Weighted sampling outlier-indexing performs
well across different selectivities.
Its relative improvement over weighted sampling
is smaller at lower selectivities.

26
Conclusion and Open Problems

Explored problems encountered when using uniform
sampling as a means for approximate query
answering.
Observations
Skew in the aggregation attribute can lead to
large errors - proposed outlier-indexing to
improve the accuracy. Demonstrated that this
technique improves accuracy significantly
Problem of low selectivity of queries - Outlined
approaches based on workload information.
Combination of outlier-indexing and weighted
sampling based on workload information has proved
to be a significant step forward.
Future works
Investigating the problem of building a single
outlier-index for different aggregates and
aggregate expressions.
Tuning the selection of the outlier-index using
the workload information is another interesting
issue.
Investigating extensions of this techniques to a
wider class of join queries.

27
References

l Acharya S., Gibbons P., and Poosala V.
Congressional samples for approximate answering
of group-by queries. In Proceedings of the ACM
SIGMOD Conference, 487-498, 2000.
2 Acharya S., Gibbons P., Poosala V., and
Ramaswamy S. Join synopses for approximate query
answering. In Proceedings of the ACM SIGMOD
Conference, 1999.
3 Bamett V. and Lewis T. Outliers in
Statistical Data. John Wiley, 3rd edition, 1994.
4 Chatfield C. The Analysis of Time Series.
Chapman and Hall, 1984.
5 Chaudhuri S., Motwani R., and Narasayya V. On
random sampling over joins. In Proceedings of the
ACM SIGMOD Conference, 1998.
6 Chaudhuri S. and Narasayya V. Program for
TPC-D data generation with skew.
ftp.research.microsoft.comlpub/users/viveknar/tpcd
skew.
7 Cochran W. G. Sampling Techniques. John Wiley
Sons, New York, third edition, 1977.
8 Ganti V., Lee M. L., and Ramakrishnan R.
ICICLES selftuning samples for approximate query
answering. In Proceedings of 26th lntemational
Conference Very Large Data
Bases, 2000.
9 Gibbons P., and Matias Y. New sampling-based
summary statistics for improving approximate
query answers. In Proceedings of the ACM SIGMOD
Conference, 33 1-342, 1998.
IO Haas P. and Hellerstein J. Ripple joins for
online aggregation. In Proceedings of the ACM
SIGMOD Conference, 287- 298, 1999.
11 Hawkins D. Identification of Outliers.
Chapman and Hall, london, 1980.

28
References

I2 Hellerstein J., Haas P., and Wang H. Online
aggregation. In Proceedings of the ACM SIGMOD
Conference, 1997.
I3 Hou W., Ozsoyoglu G., and Dogdu E.
Error-constrained COUNT query evaluation in
relational databases. In Proceedings
of the ACM SIGMOD Conference, 218-281, 1991.
I4 Jagdish H., Koudas N. and Muthukrishnan S.
Mining deviants in times series database. In
Proceedings of 25th International Conference Very
Large Data Bases, 102-1 13, 1999.
15 Knorr E. and Ng R. Algorithms for mining
distance-based outliers in large datasets. In
Proceedings of 24th International Conference Very
Large Data Bases, 392-403, 1998.
I6 Ramaswamy S., Rastogi R., and Shim K.
Efficient algorithms for mining outliers from
large data sets. In Proceedings of the ACM SIGMOD
Conference, 427-438, 2000.
I7 Manku G., Rajagopalan S., and Lindsay B.
Random sampling techniques for space efficient
online computation of order statistics of large
datasets. In Proceedings of the ACM SIGMOD
Conference, 251-262, 1999.
18 Olken E Random sampling from databases -
bibliography. http//pueblo.lbl.gov/
olken/mendel/sampling/bibliography.html.
19 Seshadri P. and Swami A. Generalized partial
indexes. In Proceedings of the I Ith
International Conference on Data Engineering,
420427, 1995.
20 Zipf G. E. Human Behavior and the Principle
of Least Effort. Addison-Wesley Press, Inc, 1949.