Title: Overcoming Limitations of Sampling for Aggregation Queries
1Overcoming Limitations of Sampling for
Aggregation Queries
- Surajit Chaudhuri, Microsoft Research
- Gautam Das, Microsoft Research
- Mayur Datar, Stanford University
- Rajiv Motwani, Stanford University
- Vivek Narasayya, Microsoft Research
Presented by Ranjan Dash CSE_at_UTA
2Outline
- Abstract
- Introduction
- Related Work
- Study of Limitations
- Handling Data skew
- Handling low selectivity small groups
- Implementation and experimental result
- Conclusion and future work
3Abstract
- Approximate aggregation query processing using
sampling - Uniform sampling performs poorly when -
- Distribution of aggregated attribute is skewed
- Queries with low selectivity
- Solution proposed by the paper -
- Outlier indexing
- Weighted sampling based on workload information
- A Combined approach significantly reduces the
approximation error.
4Introduction
- Approximate query processing used extensively by
OLAP and Data mining tools. - Use of uniform random sampling is error prone
because of - - Presence of data skew presence of outlier
values that are significantly different from the
rest in terms of their contribution to the
aggregate - Low selectivity Very few or no tuples may
satisfy the query predicate. Extrapolating from
such a small set of tuples led to error. - Two techniques to overcome these limitations of
URS. - Isolate values in the data that could contribute
heavily to the error in sampling
Outlier-indexing - Exploit the workload information for queries with
low selectivity. - Single table queries involving selection and
group by queries with sum aggregate can be
extended to other aggregation functions and
foreign key joins. - Combination of outlier-indexing and weighted
sampling based on workload info results in
significant error reduction.
5Related Work
- Related research in approximately answering
aggregation queries - Use of online sampling technique for data with
little or no variability 12 - Use of pre-computed samples or synopsis in
answering aggregation queries 2 - Techniques for fast incremental maintenance of
summery statistics to provide approximate query
answer 9 - Proposal of a weighted sampling scheme 1
- Use of weighted sampling to continuously tuning
representative sample of data - Concept of detecting and removing deviants in
time series data 14
6Limitations of uniform random sampling an
example
- Adverse impact of data skew on aggregation
queries Relation R with 10,000 tuples
aggregate on column C - 99(9900) tuples 1 9900
- 1(100) tuples 1000 100, 000
- 1 URS of R Sample size 100
- result sum of sample x scale factor(100)
- Quite likely sample would not include any tuple
of value 1000 estimated result 100x100 10,
000 - 2 or more tuples of value 1000 gets included in
sample estimated result gt 209, 800 - Reasonable result only when we get exactly one
tuple of value 1000 per sample This event has
probability 0.37. - With probability of 0.63, we would get a large
error in the estimate.
109, 900
Estimated result gtgt true sum 109, 900
7Limitations of uniform random sampling skewed
data
- Contribution of outliers to the error in
estimating sum via uniform sampling - - Relation R of size N having values y1, y2,
yN. Let U be a uniform random sample of yis
of size n. Then - Unbiased estimator
- Actual sum
- Standard error
- Where S is the standard deviation of the values
in the relation defined as - If there are outliers in the data then S could be
very large. For a given error bound we need to
increase the sample size n.
Theorem 1
Ye (N/n)
yi
8Limitations of uniform random sampling low
selectivity
- Selection query partitions relation into 2
sub-relation - Tuples that satisfy the condition relevant
sub-relation - Tuples that do not satisfy the condition
- Number of tuples sampled from relevant
sub-relation is proportional to its size. - If this relevant sample size is small due to low
selectivity of the query, it may lead to larger
error - Group by queries partition the relation into many
sub-relations - Success of uniform sampling depends on the size
of the relevant sub-relation
9Handling data skew outlier indexes
- Large variance in aggregate column due to
presence of outliers or deviants lead to large
error. - Natural idea is to deal with it separately
outlier indexing - Example selection query (Q) with sum aggregate
- Separate the outliers into a separate
sub-relation (RO) - Apply the query to this sub-relation of outliers
and get the true result of this query. - Pick a uniform random sample from the non-outlier
part of the relation (RNO) and estimate the
approximate true result using theorem 1. - Combine the above two to get overall estimate of
querys true result.
10Handling data skew outlier indexes an example
- Aggregate Query Q - Select sum (sales) from
lineitem - Preprocessing
- 1. Partition the relation into RO (outliers) and
RNO (non-outliers) - 2. Ro is lineitem_outlier a view
- 3. select a uniform random sample T from relation
RNO and materialize it in a table called
lineitem_samp - Query processing
- 1. Aggregate outliers sum(sales) from
lineitem_outlier - 2. Aggregate non-outliers apply the query to
sample T and extrapolate sum(sales) for
lineitem_samp x scale factor - 3. Combine aggregates approximate result from
RNO exact result for RO approximate result
for R
11Handling data skew outlier indexes selection
of outliers
- Query error is solely due to error in estimating
the aggregate from non-outliers - Additional overhead of maintaining and accessing
the outlier index Ro contains at most ttuples - Paper proposes an algorithm based on a theorem to
decide the optimal size of outlier-index - An optimal outlier-index Ro(R,C,t) is defined as
a sub-relation Ro c R such that - Ro t
- E(R\ Ro) min RC R ,R tE(R\R)
- Defines outlier index as an optimal sub-relation
Ro that leads to the minimum possible sampling
error, subject to the constraint that Ro has at
most t tuples in R
Definition 1
12Handling data skew outlier indexes selection
of outliers
- Theorem 2
- Consider a multiset R y1, y2,yN in sorted
order. Ro C R such that - Ro t
- S(R\ Ro) min RC R ,R tS(R\R)
- There exists some 0 t t such that Ro yi1
i t Uyi(N t 1- t) i N - States that the subset that minimizes the
standard deviation over the remaining set
consists of the leftmost t elements and the
right most t- t elements from the multiset R
when elements are arranged in sorted order.
13Handling data skew outlier indexes selection
of outliers
Alogirithm Outlier-Index(R,C, t) Let the values
in col C be sorted. For i 1 to t1, compute
E(i) S(yi, yi1,yN- ti-1) Let i be the
value of i where E(i) is minimum. Outlier-index
is the tuples that correspond to the set of
values yj1 j t Uyj(N t 1- t) j N
where t i - 1
14Handling data skew outlier indexes storage
allocation
Given sufficient space to store m tuples, how do
we allocate storage between samples and
outlier-index inorder to minimize error? S(t) is
the SD in non-outliers for an optimal
outlier-index of t. If we allocate space in m
such that t tuples in the outlier and m- t in the
sample. Then error will be proportional to
S(t)/v(m- t). Then the optimal allocation will be
the value of t for which S(t)/v(m- t) will be
minimum.
15Use of outlier-indexing extension to other
aggregates
- Count aggregate outlier-indexing not beneficial
as there is no variance among the data values - Avg aggregate same as sum aggregate
- Extensible to class of aggregates that satisfy
certain algebraic property - Real valued function aggregate like
sum(pricequantity) - Rank order dependent aggregates like min, max
etc outlier not useful - Extension to foreign-key join
- Outlier index and samples are computed over the
relation having the aggregation column and then
joined with other relations at query processing
time.
16Handling low selectivity and small groups
- Problem of low selectivity queries and small
groups in group-by queries. - Use weighted sampling instead of uniform sampling
with the help of workload information - Sample more from the subsets that are small in
size but are important having high usage - Usage of a database is typically characterized by
considerable locality in the access pattern i.e
queries against the database access certain parts
more than others. Tune the sample to a
representative workload - This technique uses pre-computed samples
17Handling low selectivity and small groups
Exploit workload info
- Workload collection collect workload consisting
of representative queries from query profilers
and query logs - Trace query patterns set of selection condition
- Trace tuple usage usage of specific tuples like
frequency of access to each tuple - tuple ti has weight wi if the tuple ti is
required to answer wi of the queries in the
workload. - Weighted sampling perform sampling taking into
account weights of the tuples
18Handling low selectivity and small groups
Exploit workload info an example
- Use of weighted sample to answer aggregation
query - Let the weight of tuple ti be wi
- Normalized weight be wi wi/ Swi
- probability of acceptance of this tuple in the
sample pi n.wi where n is the sample size - How to answer aggregation queries approximately?
- For each tuple included in the sample, store the
corresponding pi. - Each aggregate computed over this tuple gets
multiplied by the inverse of pi (multiplication
factor). - This works well if
- Access pattern is local i.e. most queries access
a small part of relation - Workload is representative of actual queries
19Implementation
- Outlier-indexing
- Table lineitem, Aggregation column
I_extendedprice - CREATE VIEW I-extendedprice-otl-idx AS SELECT
from lineitem - WHERE (I-extendedprice 5 54819.46) OR
(I-extendedprice 2 71442.88) - predicate is determined using the algorithm and
storage allocated for the outlier-index. - Module that automatically rewrites a, query to
use the outlier-index and the sample rather than
the fact table. - Uniform and weighted sampling.
- Modify the execution tree generated by the SQL
Server optimizer - For uniform sampling accepts tuples with the
specified probability (i.e., the sampling
fraction) and stores the accepted tuples in a
table. - For weighted sampling, the probability of
accepting a tuple is proportional to the weight
associated with the tuple. - Calculated exact weights for each tuple for a
given workload and store in an additional column
in fact table (lineitem) . - Converted a SELECT query in the workload into the
corresponding UPDATE statement that increments
the weight of all the tuples satisfying the
selection predicates. - Automatically substitute the sample table for the
fact table of an incoming query.
20Implementation Experimental Result
- Goal is to compare performance of uniform
sampling, weighted sampling and weighted sampling
outlier-indexing - Platform. Microsoft SQL Server 2000. Dell
Precision 610 system with a Pentium I11 Xeon 450
Mhz processor with 128 MB RAM and an external
23GB hard drive. - Databases. TPC-R benchmark that supports data
generation from a uniform distribution. - Modified the TPC-R data generation program to
generate data with varying degree of skew using
varying Zipfian parameter (z) - The ratio of the maximum value to the minimum
value of the aggregation column varied between 76
and 106. - Workloads. generated using a random query
generation program. - The program generates queries with
- foreign key joins between tables,
- aggregations on the fact table (lineitem)
- grouping
- selection.
- The sum aggregation function is used.
21Implementation Experimental Result
- Parameters
- (1) skew of the data (z) over 1, 1.5, 2, 2.5,
and 3 - (2) the sampling fraction (f) - from 1 to loo
- (3) the storage for the outlier-index - 1, 5,
I0, and 20. - Error metric
- For each query compute the relative error by
dividing the difference between the approximate
estimate for the aggregate (sum) and the accurate
value of the aggregate by the latter. - For queries with a group-by clause Consider a
query that has k groups in the answer obtained
from actually executing the query. Build a k
dimensional vector where the ith dimension
contains the relative error in the aggregate
expression for that ith group. The error is the
mean of all the points in the vector. Thus, the
average relative error over all groups is
reported. The error metric for a workload is
average error over all queries in the workload.
22Implementation Experimental Result
- Uniform sampling
- weighted sampling (WSAMP)
- weighted sampling outlier-indexing
(WSAMPOTLIDX).
23Experimental Result - Varying the data skew
- Vary the data skew, while keeping the sampling
fraction constant (5). - Weighted sampling does consistently better than
uniform sampling across all data skews - Weighted sampling outlier-indexing performs
significantly better than weighted sampling alone.
24Experimental Result - Varying the sampling
fraction
- varied the sampling fraction with fixed data skew
(z2). - weighted sampling gives lower error than uniform
sampling across all sampling fractions. - greatest benefit occurs at low sampling fractions
(e.g., 1). - use of outlier-indexing in addition to weighted
sampling improves accuracy significantly.
25Experimental Result - Varying the selectivity of
queries
- vary selectivity of queries between 1 and 100
with fixed data skew (z2) and the sampling
fraction (fl). - weighted sampling better than uniform sampling at
very low selectivity. - when the workload references large portions of
the data (e.g., at 100 selectivity) uniform and
weighted sampling are not significantly
different. - Weighted sampling outlier-indexing performs
well across different selectivities. - Its relative improvement over weighted sampling
is smaller at lower selectivities.
26Conclusion and Open Problems
- Explored problems encountered when using uniform
sampling as a means for approximate query
answering. - Observations
- Skew in the aggregation attribute can lead to
large errors - proposed outlier-indexing to
improve the accuracy. Demonstrated that this
technique improves accuracy significantly - Problem of low selectivity of queries - Outlined
approaches based on workload information.
Combination of outlier-indexing and weighted
sampling based on workload information has proved
to be a significant step forward. - Future works
- Investigating the problem of building a single
outlier-index for different aggregates and
aggregate expressions. - Tuning the selection of the outlier-index using
the workload information is another interesting
issue. - Investigating extensions of this techniques to a
wider class of join queries.
27References
- l Acharya S., Gibbons P., and Poosala V.
Congressional samples for approximate answering
of group-by queries. In Proceedings of the ACM
SIGMOD Conference, 487-498, 2000. - 2 Acharya S., Gibbons P., Poosala V., and
Ramaswamy S. Join synopses for approximate query
answering. In Proceedings of the ACM SIGMOD
Conference, 1999. - 3 Bamett V. and Lewis T. Outliers in
Statistical Data. John Wiley, 3rd edition, 1994. - 4 Chatfield C. The Analysis of Time Series.
Chapman and Hall, 1984. - 5 Chaudhuri S., Motwani R., and Narasayya V. On
random sampling over joins. In Proceedings of the
ACM SIGMOD Conference, 1998. - 6 Chaudhuri S. and Narasayya V. Program for
TPC-D data generation with skew. - ftp.research.microsoft.comlpub/users/viveknar/tpcd
skew. - 7 Cochran W. G. Sampling Techniques. John Wiley
Sons, New York, third edition, 1977. - 8 Ganti V., Lee M. L., and Ramakrishnan R.
ICICLES selftuning samples for approximate query
answering. In Proceedings of 26th lntemational
Conference Very Large Data - Bases, 2000.
- 9 Gibbons P., and Matias Y. New sampling-based
summary statistics for improving approximate
query answers. In Proceedings of the ACM SIGMOD
Conference, 33 1-342, 1998. - IO Haas P. and Hellerstein J. Ripple joins for
online aggregation. In Proceedings of the ACM
SIGMOD Conference, 287- 298, 1999. - 11 Hawkins D. Identification of Outliers.
Chapman and Hall, london, 1980.
28References
- I2 Hellerstein J., Haas P., and Wang H. Online
aggregation. In Proceedings of the ACM SIGMOD
Conference, 1997. - I3 Hou W., Ozsoyoglu G., and Dogdu E.
Error-constrained COUNT query evaluation in
relational databases. In Proceedings - of the ACM SIGMOD Conference, 218-281, 1991.
- I4 Jagdish H., Koudas N. and Muthukrishnan S.
Mining deviants in times series database. In
Proceedings of 25th International Conference Very
Large Data Bases, 102-1 13, 1999. - 15 Knorr E. and Ng R. Algorithms for mining
distance-based outliers in large datasets. In
Proceedings of 24th International Conference Very
Large Data Bases, 392-403, 1998. - I6 Ramaswamy S., Rastogi R., and Shim K.
Efficient algorithms for mining outliers from
large data sets. In Proceedings of the ACM SIGMOD
Conference, 427-438, 2000. - I7 Manku G., Rajagopalan S., and Lindsay B.
Random sampling techniques for space efficient
online computation of order statistics of large
datasets. In Proceedings of the ACM SIGMOD
Conference, 251-262, 1999. - 18 Olken E Random sampling from databases -
bibliography. http//pueblo.lbl.gov/
olken/mendel/sampling/bibliography.html. - 19 Seshadri P. and Swami A. Generalized partial
indexes. In Proceedings of the I Ith
International Conference on Data Engineering,
420427, 1995. - 20 Zipf G. E. Human Behavior and the Principle
of Least Effort. Addison-Wesley Press, Inc, 1949.
29