Overcoming Limitations of Sampling for Aggregation Queries - PowerPoint PPT Presentation

About This Presentation
Title:

Overcoming Limitations of Sampling for Aggregation Queries

Description:

Overcoming Limitations of Sampling for Aggregation ... Weighted sampling based on workload information ... Unbiased estimator. Actual sum. Standard error ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 30
Provided by: Ran54
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Overcoming Limitations of Sampling for Aggregation Queries


1
Overcoming Limitations of Sampling for
Aggregation Queries
  • Surajit Chaudhuri, Microsoft Research
  • Gautam Das, Microsoft Research
  • Mayur Datar, Stanford University
  • Rajiv Motwani, Stanford University
  • Vivek Narasayya, Microsoft Research

Presented by Ranjan Dash CSE_at_UTA
2
Outline
  • Abstract
  • Introduction
  • Related Work
  • Study of Limitations
  • Handling Data skew
  • Handling low selectivity small groups
  • Implementation and experimental result
  • Conclusion and future work

3
Abstract
  • Approximate aggregation query processing using
    sampling
  • Uniform sampling performs poorly when -
  • Distribution of aggregated attribute is skewed
  • Queries with low selectivity
  • Solution proposed by the paper -
  • Outlier indexing
  • Weighted sampling based on workload information
  • A Combined approach significantly reduces the
    approximation error.

4
Introduction
  • Approximate query processing used extensively by
    OLAP and Data mining tools.
  • Use of uniform random sampling is error prone
    because of -
  • Presence of data skew presence of outlier
    values that are significantly different from the
    rest in terms of their contribution to the
    aggregate
  • Low selectivity Very few or no tuples may
    satisfy the query predicate. Extrapolating from
    such a small set of tuples led to error.
  • Two techniques to overcome these limitations of
    URS.
  • Isolate values in the data that could contribute
    heavily to the error in sampling
    Outlier-indexing
  • Exploit the workload information for queries with
    low selectivity.
  • Single table queries involving selection and
    group by queries with sum aggregate can be
    extended to other aggregation functions and
    foreign key joins.
  • Combination of outlier-indexing and weighted
    sampling based on workload info results in
    significant error reduction.

5
Related Work
  • Related research in approximately answering
    aggregation queries
  • Use of online sampling technique for data with
    little or no variability 12
  • Use of pre-computed samples or synopsis in
    answering aggregation queries 2
  • Techniques for fast incremental maintenance of
    summery statistics to provide approximate query
    answer 9
  • Proposal of a weighted sampling scheme 1
  • Use of weighted sampling to continuously tuning
    representative sample of data
  • Concept of detecting and removing deviants in
    time series data 14

6
Limitations of uniform random sampling an
example
  • Adverse impact of data skew on aggregation
    queries Relation R with 10,000 tuples
    aggregate on column C
  • 99(9900) tuples 1 9900
  • 1(100) tuples 1000 100, 000
  • 1 URS of R Sample size 100
  • result sum of sample x scale factor(100)
  • Quite likely sample would not include any tuple
    of value 1000 estimated result 100x100 10,
    000
  • 2 or more tuples of value 1000 gets included in
    sample estimated result gt 209, 800
  • Reasonable result only when we get exactly one
    tuple of value 1000 per sample This event has
    probability 0.37.
  • With probability of 0.63, we would get a large
    error in the estimate.

109, 900
Estimated result gtgt true sum 109, 900
7
Limitations of uniform random sampling skewed
data
  • Contribution of outliers to the error in
    estimating sum via uniform sampling -
  • Relation R of size N having values y1, y2,
    yN. Let U be a uniform random sample of yis
    of size n. Then
  • Unbiased estimator
  • Actual sum
  • Standard error
  • Where S is the standard deviation of the values
    in the relation defined as
  • If there are outliers in the data then S could be
    very large. For a given error bound we need to
    increase the sample size n.

Theorem 1
Ye (N/n)
yi
8
Limitations of uniform random sampling low
selectivity
  • Selection query partitions relation into 2
    sub-relation
  • Tuples that satisfy the condition relevant
    sub-relation
  • Tuples that do not satisfy the condition
  • Number of tuples sampled from relevant
    sub-relation is proportional to its size.
  • If this relevant sample size is small due to low
    selectivity of the query, it may lead to larger
    error
  • Group by queries partition the relation into many
    sub-relations
  • Success of uniform sampling depends on the size
    of the relevant sub-relation

9
Handling data skew outlier indexes
  • Large variance in aggregate column due to
    presence of outliers or deviants lead to large
    error.
  • Natural idea is to deal with it separately
    outlier indexing
  • Example selection query (Q) with sum aggregate
  • Separate the outliers into a separate
    sub-relation (RO)
  • Apply the query to this sub-relation of outliers
    and get the true result of this query.
  • Pick a uniform random sample from the non-outlier
    part of the relation (RNO) and estimate the
    approximate true result using theorem 1.
  • Combine the above two to get overall estimate of
    querys true result.

10
Handling data skew outlier indexes an example
  • Aggregate Query Q - Select sum (sales) from
    lineitem
  • Preprocessing
  • 1. Partition the relation into RO (outliers) and
    RNO (non-outliers)
  • 2. Ro is lineitem_outlier a view
  • 3. select a uniform random sample T from relation
    RNO and materialize it in a table called
    lineitem_samp
  • Query processing
  • 1. Aggregate outliers sum(sales) from
    lineitem_outlier
  • 2. Aggregate non-outliers apply the query to
    sample T and extrapolate sum(sales) for
    lineitem_samp x scale factor
  • 3. Combine aggregates approximate result from
    RNO exact result for RO approximate result
    for R

11
Handling data skew outlier indexes selection
of outliers
  • Query error is solely due to error in estimating
    the aggregate from non-outliers
  • Additional overhead of maintaining and accessing
    the outlier index Ro contains at most ttuples
  • Paper proposes an algorithm based on a theorem to
    decide the optimal size of outlier-index
  • An optimal outlier-index Ro(R,C,t) is defined as
    a sub-relation Ro c R such that
  • Ro t
  • E(R\ Ro) min RC R ,R tE(R\R)
  • Defines outlier index as an optimal sub-relation
    Ro that leads to the minimum possible sampling
    error, subject to the constraint that Ro has at
    most t tuples in R

Definition 1
12
Handling data skew outlier indexes selection
of outliers
  • Theorem 2
  • Consider a multiset R y1, y2,yN in sorted
    order. Ro C R such that
  • Ro t
  • S(R\ Ro) min RC R ,R tS(R\R)
  • There exists some 0 t t such that Ro yi1
    i t Uyi(N t 1- t) i N
  • States that the subset that minimizes the
    standard deviation over the remaining set
    consists of the leftmost t elements and the
    right most t- t elements from the multiset R
    when elements are arranged in sorted order.

13
Handling data skew outlier indexes selection
of outliers
Alogirithm Outlier-Index(R,C, t) Let the values
in col C be sorted. For i 1 to t1, compute
E(i) S(yi, yi1,yN- ti-1) Let i be the
value of i where E(i) is minimum. Outlier-index
is the tuples that correspond to the set of
values yj1 j t Uyj(N t 1- t) j N
where t i - 1
14
Handling data skew outlier indexes storage
allocation
Given sufficient space to store m tuples, how do
we allocate storage between samples and
outlier-index inorder to minimize error? S(t) is
the SD in non-outliers for an optimal
outlier-index of t. If we allocate space in m
such that t tuples in the outlier and m- t in the
sample. Then error will be proportional to
S(t)/v(m- t). Then the optimal allocation will be
the value of t for which S(t)/v(m- t) will be
minimum.
15
Use of outlier-indexing extension to other
aggregates
  • Count aggregate outlier-indexing not beneficial
    as there is no variance among the data values
  • Avg aggregate same as sum aggregate
  • Extensible to class of aggregates that satisfy
    certain algebraic property
  • Real valued function aggregate like
    sum(pricequantity)
  • Rank order dependent aggregates like min, max
    etc outlier not useful
  • Extension to foreign-key join
  • Outlier index and samples are computed over the
    relation having the aggregation column and then
    joined with other relations at query processing
    time.

16
Handling low selectivity and small groups
  • Problem of low selectivity queries and small
    groups in group-by queries.
  • Use weighted sampling instead of uniform sampling
    with the help of workload information
  • Sample more from the subsets that are small in
    size but are important having high usage
  • Usage of a database is typically characterized by
    considerable locality in the access pattern i.e
    queries against the database access certain parts
    more than others. Tune the sample to a
    representative workload
  • This technique uses pre-computed samples

17
Handling low selectivity and small groups
Exploit workload info
  • Workload collection collect workload consisting
    of representative queries from query profilers
    and query logs
  • Trace query patterns set of selection condition
  • Trace tuple usage usage of specific tuples like
    frequency of access to each tuple
  • tuple ti has weight wi if the tuple ti is
    required to answer wi of the queries in the
    workload.
  • Weighted sampling perform sampling taking into
    account weights of the tuples

18
Handling low selectivity and small groups
Exploit workload info an example
  • Use of weighted sample to answer aggregation
    query
  • Let the weight of tuple ti be wi
  • Normalized weight be wi wi/ Swi
  • probability of acceptance of this tuple in the
    sample pi n.wi where n is the sample size
  • How to answer aggregation queries approximately?
  • For each tuple included in the sample, store the
    corresponding pi.
  • Each aggregate computed over this tuple gets
    multiplied by the inverse of pi (multiplication
    factor).
  • This works well if
  • Access pattern is local i.e. most queries access
    a small part of relation
  • Workload is representative of actual queries

19
Implementation
  • Outlier-indexing
  • Table lineitem, Aggregation column
    I_extendedprice
  • CREATE VIEW I-extendedprice-otl-idx AS SELECT
    from lineitem
  • WHERE (I-extendedprice 5 54819.46) OR
    (I-extendedprice 2 71442.88)
  • predicate is determined using the algorithm and
    storage allocated for the outlier-index.
  • Module that automatically rewrites a, query to
    use the outlier-index and the sample rather than
    the fact table.
  • Uniform and weighted sampling.
  • Modify the execution tree generated by the SQL
    Server optimizer
  • For uniform sampling accepts tuples with the
    specified probability (i.e., the sampling
    fraction) and stores the accepted tuples in a
    table.
  • For weighted sampling, the probability of
    accepting a tuple is proportional to the weight
    associated with the tuple.
  • Calculated exact weights for each tuple for a
    given workload and store in an additional column
    in fact table (lineitem) .
  • Converted a SELECT query in the workload into the
    corresponding UPDATE statement that increments
    the weight of all the tuples satisfying the
    selection predicates.
  • Automatically substitute the sample table for the
    fact table of an incoming query.

20
Implementation Experimental Result
  • Goal is to compare performance of uniform
    sampling, weighted sampling and weighted sampling
    outlier-indexing
  • Platform. Microsoft SQL Server 2000. Dell
    Precision 610 system with a Pentium I11 Xeon 450
    Mhz processor with 128 MB RAM and an external
    23GB hard drive.
  • Databases. TPC-R benchmark that supports data
    generation from a uniform distribution.
  • Modified the TPC-R data generation program to
    generate data with varying degree of skew using
    varying Zipfian parameter (z)
  • The ratio of the maximum value to the minimum
    value of the aggregation column varied between 76
    and 106.
  • Workloads. generated using a random query
    generation program.
  • The program generates queries with
  • foreign key joins between tables,
  • aggregations on the fact table (lineitem)
  • grouping
  • selection.
  • The sum aggregation function is used.

21
Implementation Experimental Result
  • Parameters
  • (1) skew of the data (z) over 1, 1.5, 2, 2.5,
    and 3
  • (2) the sampling fraction (f) - from 1 to loo
  • (3) the storage for the outlier-index - 1, 5,
    I0, and 20.
  • Error metric
  • For each query compute the relative error by
    dividing the difference between the approximate
    estimate for the aggregate (sum) and the accurate
    value of the aggregate by the latter.
  • For queries with a group-by clause Consider a
    query that has k groups in the answer obtained
    from actually executing the query. Build a k
    dimensional vector where the ith dimension
    contains the relative error in the aggregate
    expression for that ith group. The error is the
    mean of all the points in the vector. Thus, the
    average relative error over all groups is
    reported. The error metric for a workload is
    average error over all queries in the workload.

22
Implementation Experimental Result
  • Uniform sampling
  • weighted sampling (WSAMP)
  • weighted sampling outlier-indexing
    (WSAMPOTLIDX).

23
Experimental Result - Varying the data skew
  • Vary the data skew, while keeping the sampling
    fraction constant (5).
  • Weighted sampling does consistently better than
    uniform sampling across all data skews
  • Weighted sampling outlier-indexing performs
    significantly better than weighted sampling alone.

24
Experimental Result - Varying the sampling
fraction
  • varied the sampling fraction with fixed data skew
    (z2).
  • weighted sampling gives lower error than uniform
    sampling across all sampling fractions.
  • greatest benefit occurs at low sampling fractions
    (e.g., 1).
  • use of outlier-indexing in addition to weighted
    sampling improves accuracy significantly.

25
Experimental Result - Varying the selectivity of
queries
  • vary selectivity of queries between 1 and 100
    with fixed data skew (z2) and the sampling
    fraction (fl).
  • weighted sampling better than uniform sampling at
    very low selectivity.
  • when the workload references large portions of
    the data (e.g., at 100 selectivity) uniform and
    weighted sampling are not significantly
    different.
  • Weighted sampling outlier-indexing performs
    well across different selectivities.
  • Its relative improvement over weighted sampling
    is smaller at lower selectivities.

26
Conclusion and Open Problems
  • Explored problems encountered when using uniform
    sampling as a means for approximate query
    answering.
  • Observations
  • Skew in the aggregation attribute can lead to
    large errors - proposed outlier-indexing to
    improve the accuracy. Demonstrated that this
    technique improves accuracy significantly
  • Problem of low selectivity of queries - Outlined
    approaches based on workload information.
    Combination of outlier-indexing and weighted
    sampling based on workload information has proved
    to be a significant step forward.
  • Future works
  • Investigating the problem of building a single
    outlier-index for different aggregates and
    aggregate expressions.
  • Tuning the selection of the outlier-index using
    the workload information is another interesting
    issue.
  • Investigating extensions of this techniques to a
    wider class of join queries.

27
References
  • l Acharya S., Gibbons P., and Poosala V.
    Congressional samples for approximate answering
    of group-by queries. In Proceedings of the ACM
    SIGMOD Conference, 487-498, 2000.
  • 2 Acharya S., Gibbons P., Poosala V., and
    Ramaswamy S. Join synopses for approximate query
    answering. In Proceedings of the ACM SIGMOD
    Conference, 1999.
  • 3 Bamett V. and Lewis T. Outliers in
    Statistical Data. John Wiley, 3rd edition, 1994.
  • 4 Chatfield C. The Analysis of Time Series.
    Chapman and Hall, 1984.
  • 5 Chaudhuri S., Motwani R., and Narasayya V. On
    random sampling over joins. In Proceedings of the
    ACM SIGMOD Conference, 1998.
  • 6 Chaudhuri S. and Narasayya V. Program for
    TPC-D data generation with skew.
  • ftp.research.microsoft.comlpub/users/viveknar/tpcd
    skew.
  • 7 Cochran W. G. Sampling Techniques. John Wiley
    Sons, New York, third edition, 1977.
  • 8 Ganti V., Lee M. L., and Ramakrishnan R.
    ICICLES selftuning samples for approximate query
    answering. In Proceedings of 26th lntemational
    Conference Very Large Data
  • Bases, 2000.
  • 9 Gibbons P., and Matias Y. New sampling-based
    summary statistics for improving approximate
    query answers. In Proceedings of the ACM SIGMOD
    Conference, 33 1-342, 1998.
  • IO Haas P. and Hellerstein J. Ripple joins for
    online aggregation. In Proceedings of the ACM
    SIGMOD Conference, 287- 298, 1999.
  • 11 Hawkins D. Identification of Outliers.
    Chapman and Hall, london, 1980.

28
References
  • I2 Hellerstein J., Haas P., and Wang H. Online
    aggregation. In Proceedings of the ACM SIGMOD
    Conference, 1997.
  • I3 Hou W., Ozsoyoglu G., and Dogdu E.
    Error-constrained COUNT query evaluation in
    relational databases. In Proceedings
  • of the ACM SIGMOD Conference, 218-281, 1991.
  • I4 Jagdish H., Koudas N. and Muthukrishnan S.
    Mining deviants in times series database. In
    Proceedings of 25th International Conference Very
    Large Data Bases, 102-1 13, 1999.
  • 15 Knorr E. and Ng R. Algorithms for mining
    distance-based outliers in large datasets. In
    Proceedings of 24th International Conference Very
    Large Data Bases, 392-403, 1998.
  • I6 Ramaswamy S., Rastogi R., and Shim K.
    Efficient algorithms for mining outliers from
    large data sets. In Proceedings of the ACM SIGMOD
    Conference, 427-438, 2000.
  • I7 Manku G., Rajagopalan S., and Lindsay B.
    Random sampling techniques for space efficient
    online computation of order statistics of large
    datasets. In Proceedings of the ACM SIGMOD
    Conference, 251-262, 1999.
  • 18 Olken E Random sampling from databases -
    bibliography. http//pueblo.lbl.gov/
    olken/mendel/sampling/bibliography.html.
  • 19 Seshadri P. and Swami A. Generalized partial
    indexes. In Proceedings of the I Ith
    International Conference on Data Engineering,
    420427, 1995.
  • 20 Zipf G. E. Human Behavior and the Principle
    of Least Effort. Addison-Wesley Press, Inc, 1949.

29
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com