Reading Report 6 Yin Chen 5 Mar 2004 - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Reading Report 6 Yin Chen 5 Mar 2004

Description:

For a given relation R and workload W, consider partitioning the records in R ... i.e. Consider a relation R (with aggregate column C) containing nine records ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 13
Provided by: yin
Category:
Tags: chen | mar | reading | relation | report | yin

less

Transcript and Presenter's Notes

Title: Reading Report 6 Yin Chen 5 Mar 2004


1
Reading Report 6Yin Chen 5 Mar 2004
  • Reference
  • A Robust, Optimization-Based Approach for
    Approximate Answering of Aggregate Queries,
    Surajit Chaudhuri, Gautam das, Vivek Narasayya,
    Microsoft Research, One Microsoft Way.
  • http//portal.acm.org/citation.cfm?id375694dlAC
    Mcollportal

2
Problem Over View
  • Decision support applications such as On Line
    Analytical Processing (OLAP) and data mining for
    analyzing LARGE databases have become popular.
  • But expensive and resource intensive
  • This work uses precomputed samples of the data
    instead of the complete data to answer the
    queries, to give approximate answers efficiently.
  • 3 drawbacks of previous works
  • Lack of formulation thus difficult to evaluate
    theoretically
  • Do NOT deal with uncertainty
  • Ignore the variance in the data distribution of
    the aggregated column(s).

3
Related work
  • Some works based on randomized techniques, assume
    a fixed workload, and do NOT cope with
    uncertainty.
  • Each record is tagged with a frequency.
  • An expected number of k records are selected in
    the sample, where the probability of selecting a
    record t with frequency ft is k(ft/Sufu))
  • Records that are accessed more frequently have a
    greater chance of being included inside the
    sample.
  • BUT has poor quality. i.e. Consider a set of
    queries, let a few queries reference large
    partitions and most queries reference very small
    partitions. By the weighted sampling scheme most
    record will come from the large partitions. Thus,
    with high probability, there will be no records
    selected from many of the small partitions,
    causing large error.
  • An IMPROVMENT collected outliers of the data
    (i.e. the records that contribute to high
    variance) into a separate index, while the
    remaining data is sampled using a weighted
    sampling technique. Queries are answered by
    running them against both the outlier index and
    the weighted sample, and an estimated answer is
    composed out of both results.
  • Some other works use on-the-fly sampling, but can
    be expensive.

4
Architecture
  • Workload
  • A workload W is specified as a set of pairs of
    queries and their corresponding weights i.e., W
    ltQ1,w1gt, ltQq,wqgt
  • Weight wi indicates the importance of query Qi in
    the workload.
  • Without loss of generality, assume the weights
    are normalized, i.e., Siwi 1
  • Architecture
  • Inputs a database and a workload W
  • 2 components
  • An offline component for selecting a sample
  • An online component that
  • Rewrites an incoming query to use the sample to
    answer the query approximately.
  • Reports the answer with an estimate of the error
    in the answer.
  • ScaleFactor
  • Each record in the sample contains an additional
    column, ScaleFactor.
  • The value of the aggregated column of each record
    in the sample is first scaled up by multiplying
    with the ScaleFactor, and then aggregated.

5
Architecture (Cont.)
  • Error metrics used to determine the quality of
    an approximate answer to an aggregation query.
  • Suppose the correct answer for a query Q is y
    while the approximate answer is y
  • Relative error E(Q) y - y / y
  • Squared error SE(Q) (y - y / y)²
  • Suppose the correct answer for the ith group is
    yi while the approximate answer is yi
  • Squared error in answering a GROUP BY query Q
    SE(Q) (1/g) Si ((yi yi)/ yi)²
  • Given a probability distribution of queries pw
  • Mean squared error for the distribution MSE(pw)
    SQ pw (Q)SE(Q), ( pw (Q) is the probability of
    query Q)
  • Root mean squared error RMSE(pw) square root
    of MSE(pw)
  • Other error metrics
  • L1 metrics the mean error over all queries in
    the workload
  • L8 metrics the maximum error over all queries

6
The Special Case of a Fixed Workload
  • Overview
  • Here provide a solution for the special case of a
    fixed workload, i.e., when the incoming queries
    are identical to the given workload.
  • Use an effective deterministic scheme rater than
    the conventional randomization scheme.
  • Problem Formulation
  • Problem FIXEDSAMP
  • Input R, W, k
  • Output A sample of k records (with
    appropriate additional columns) such that MSE(W)
    is minimized
  • MSE(W) MSE(Pw) (Mean squared error ), where a
    query Q has a probability of occurrence of 1 is Q
    ?W and 0 otherwise.
  • Fundamental Regions
  • For a given relation R and workload W, consider
    partitioning the records in R into a minimum
    number of regions F R1, R2, , Rr such that
    for any region Rj, each query in W selects either
    all records in Rj or none. These regions are the
    fundamental regions of R induced by W.
  • i.e. Consider a relation R (with aggregate column
    C) containing nine records (with C values 10, 20,
    , 90). Let W consist of two queries, Q1 (which
    selects records with C values between 10 and 50)
    and Q2 (which selects records with C values
    between 40 and 70). These two queries induce a
    partition of R into four fundamental regions, R1,
    R1, R1, R4.
  • In general the total number of fundamental
    regions r depends
  • on R and W and is upper-bounded by min(2W,
    n), where n
  • is the number of records in R.

7
The Special Case of a Fixed Workload (Cont.)
  • Solutions for FIXEDSAMP
  • A deterministic algorithm called FIXED
  • Step1 (Identify Fundamental Regions) Let r be
    the number of fundamental regions.
  • Case A r k (the selected sample can answer
    queries WITHOUT any errors)
  • Step 2A (Pick Sample Records) Pick exactly one
    record from each fundamental region.
  • Step 3A (Assign Values to Additional Columns)
    The idea is that each sample record can be used
    to summarize all records from the corresponding
    fundamental region, WITHOUT incurring any error.
  • For a workload consisting of ONLY COUNT queries,
    store the count of the number of records in that
    fundamental region in a SINGLE additional column
    in the sample records (called RegionCount).
  • For a workload consisting of ONLY SUM queries,
    store the sum of the values in the aggregate
    column for records in that fundamental region in
    the SINGLE additional column in the sample
    (called AggSum).
  • For a workload contains a MIX of COUNT and SUM
    queries, need BOTH the RegionCount and AggSum
    column, which can also answer AVG queries.
  • Case B r gt k (select sample that try to
    minimize the errors in queries)
  • Step 2B (Pick Sample Records)
  • Sort all r regions by their importance, select
    the TOP k, and then pick up one record from each
    of the selected regions.
  • The importance of region Rj is defined as fj
    nj², where fj is the sum of the weights of all
    queries in W that select the region, nj is the
    number of records in the region. fj measures the
    weights of queries that are affected by Rj while
    nj² measures the effect on the (squared) error by
    not including Rj.
  • Step 3B (Assign Values to Additional Columns)
  • Note that the extra column values of a sample are
    NOT required to characterize the corresponding
    fundamental region all we care is that they
    contain appropriate values so that the error for
    the workload is minimized.
  • To assign values to the RegionCount and AggSum
    columns of the k selected sample records, express
    MSE(W) as a quadratic function of 2k unknowns
    RC1,,RCk and AS1,, ASk, and partially
    differentiating with each variable and setting
    each result to zero. This gives rise to 2k
    simultaneous (sparse) linear equations, which can
    be solved by using an iterative technique.

8
The Non-Special Case with a Lifted Workload
  • Overview
  • Here provide a solution for the non-special case
    which incoming query is similar but NOT
    identical to queries.
  • The problem was focus on the SINGLE-TABLE
    selection queries with aggregation containing
    either the SUM or COUNT aggregate. A workload W
    consists of exactly ONE query Q on relation R.
  • Problem Formulation
  • Problem SAMP
  • Input R, Pw (a probability distribution
    function specified by W), and k
  • Output A sample of k records, (with the
    appropriate additional column(s)) such that the
    MSE(Pw) is minimized.
  • lifted workload
  • For a given W, define a lifted workload pw, i.e.,
    a probability distribution of incoming queries.
    Intuitively, for any query Q (not necessarily in
    W), pw(Q) should be related to the amount of
    similarity (dissimilarity) of Q to the workload
    high if Q is similar to queries in the workload,
    and low otherwise.
  • We say that two queries Q and Q are similar if
    the records selected by Q and Q have significant
    overlap.
  • The objective is to define the distribution PQ
  • Since for the purposed of lifting, only concern
    the set of records selected by a query and NOT
    the query itself. Thus instead of mapping queries
    to probabilities, PQ maps subsets of R to
    probabilities.

9
The Non-Special Case with a Lifted Workload
  • lifted workload (Cont.)
  • Assume two parameters d (½ d 1) and ? (0 ?
    ½). These parameters define the degree to which
    the workload influences the query distribution.
    For any given record inside (resp. outside) RQ,
    the parameter d (resp. ?) represents the
    probability that an incoming query will select
    this record.
  • For all R?R, PQ(R) is the probability of
    occurrence of any query that selects exactly the
    set of records R.
  • RQ is the records selected by Q.
  • n1, n2, n3, and n4 are the counts of records in
    the regions.
  • When n2 or n4 are large (i.e., the overlap is
    large), PQ(R) is high (i.e. queries that
    select RQ are likely to occur).
  • When n1 or n3 are large (i.e., the overlap is
    small), PQ(R) is low (i.e. queries that select
    RQ are unlikely to occur).
  • Setting the parameters d and ?
  • d ? 1 and ? ? 0 implies that incoming
    queries are identical to workload queries
  • d ? 1 and ? ? ½ implies that incoming queries
    are supersets to workload queries
  • d ? ½ and ? ? 0 implies that incoming queries
    are subsets to workload queries
  • d ? ½ and ? ? ½ implies that incoming queries
    are unrestricted

10
The Non-Special Case with a Lifted Workload
  • Stratified sampling
  • Stratified sampling is a well-known
    generalization of uniform sampling where a
    population is partitioned into multiple strata
    and samples are selected uniformly from each
    stratum, with important strata contributing
    relatively more samples.
  • Define population of a query Q (denoted by POPQ)
    on a relation R as a set of size R that
    contains the value of the aggregated column that
    is selected by Q, or 0 if the record is not
    selected.
  • A stratified sampling scheme partitions R into r
    strata containing n1,,nr records (where Snj
    n), with k1,,kr records uniformly sampled from
    each stratum (where Skj k).
  • The scheme also associates a ScaleFactor with
    each record in the sample. Queries are answered
    by execution them on the sample instead of R. For
    a COUNT query, the ScaleFactor entries of the
    selected records are summed, while for a SUM(y)
    query the expression yScaleFactor is summed. If
    also wish to return an error guarantee with each
    query, then instead of ScaleFactor, we have to
    keep track of each nj and kj individually for
    each stratum.
  • Solution for STRAT
  • 3 steps
  • Step one stratification step, determine
  • (a) How many strata r to partition relation R
    into, and
  • (b) The records from R that belong to each
    stratum.
  • At the end of step one, we have r strata
    R1,,Rr containing n1,,nr records such that Snj
    n.
  • Step two allocation step, determine how to
    divide k (the number of records available for the
    sample) into integers k1,,kr across the r strata
    such that Skj k.
  • Step three sampling step, uniformly samples kj
    records from stratum Rj to form the final sample
    of k records.

11
The Non-Special Case with a Lifted Workload
  • Solution for STRAT (Cont.)
  • Solution for COUNT Aggregate
  • Stratification Step
  • Lemma 1 Consider a relation R with n
    records and a workload W of COUNT queries. In the
    limit when n tends to infinity, the fundamental
    regions F R1, R2, , Rr represent an optimal
    stratification.
  • Allocation Step
  • (1) Express MSE(pw) as a weighted sum of the MSE
    of each query in the workload
  • Lemma 2 MSE(pw) Si wi MSE(pQ)
  • (2) For any Q?W, express MSE(pQ) as a function
    of the kjs
  • Lemma 3 For a COUNT query Q in W, let
    ApproxMSE(pQ)
  • Then
  • Since have an (approximate) formula for
    MSE(pQ), we can exppress MSE(pw) as a function
    of the variables kj
  • Corollary 1 MSE(pw) Sj(aj/ kj), where
    each aj is a function of n1,,nr, d, and ?
  • aj captures the importance of a region it
    is positively correlated with nj as well as the
    frequency of queries in the workload that access
    Rj
  • (3) Minimize MSE(pw)
  • Lemma 4 Sj(aj/ kj) is minimized subject to
    Sjkj k if kj k(sqrt(aj) / Sisqrt(ai))
  • Lemma 4 provides a closed-form and
    computationally inexpensive solution to the
    allocation problem since aj depends only on d, ?
    and the number of tuples in each fundamental
    region.

12
The Non-Special Case with a Lifted Workload
  • Solution for STRAT (Cont.)
  • Solution for SUM Aggregate
  • Stratification Step
  • Since each stratum may have LARGE internal
    variance in the values of the aggregate column,
    CANNOT use the same stratification as in the
    COUNT case, i.e., strata fundamental regions.
  • Divide fundamental regions with large variance
    into a set of finer regions, each of which has
    significantly lower internal variance, and treat
    these finer regions as the strata. Within a new
    stratum the aggregate column values of records
    are CLOSE to one another.
  • Borrow from statistics literature an
    approximation of the optimal Neymann Allocation
    technique for minimizing variance, use it to
    divide each fundamental region into h finer
    regions, thus generating a total of hr, which
    become the strata. (h was set to 6).
  • Allocation Step
  • (1) Similar to COUNT, it is expressed as an
    optimization problem with hr unknowns k1,,
    khr.
  • (2) Different from COUNT, the specific values of
    the aggregate column in each region influence
    MSE(pQ). Let yj (Yj) be the AVERAG (sum) of the
    aggregate column values of all records in region
    Rj. Since the variance within each region is
    SMALL (due to stratification), we can assume that
    each value within the region can be approximated
    as yj. Thus to express MSE(pQ) as a function of
    the kjs for a SUM query Q in W
  • As with COUNT, MSE(pW) for SUM is
    functionally of the form Sj(aj/ kj), and aj
    depends on the same parameters n1, nhr , d, and
    ? (see Corollary 1),
  • (3)The same for the minimization step as in Lemma
    4.
Write a Comment
User Comments (0)
About PowerShow.com