Robust Estimation With Sampling and Approximate PreAggregation - PowerPoint PPT Presentation

About This Presentation
Title:

Robust Estimation With Sampling and Approximate PreAggregation

Description:

Can not handle too many dimensions. Stratified sampling (used to ... Using a sample of 50% of the data (unshaded) to answer the query: select SUM(cmplaints) ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 19
Provided by: bille1
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Robust Estimation With Sampling and Approximate PreAggregation


1
Robust Estimation With Sampling and Approximate
Pre-Aggregation
  • Author Christopher Jermaine
  • Presented by Bill Eberle

2
Overview
  • Issues and Questions
  • Approximate Pre-Aggregation (APA)
  • Maximum Likelihood Estimation
  • Performing the Estimation
  • Experiments
  • Observations

3
Issues
  • Assumes numerical or discretized attributes
  • Many data warehouse (for DSS) attributes are
    categorical (ex. country France)
  • Can not handle too many dimensions
  • Stratified sampling (used to handle variance in
    values over which aggregation is performed)
  • Assumes workload is known
  • Targeting particular queries means other queries
    suffer

4
Questions
  • Can we decrease the effect of random sampling
    variance on categorical (or mixed) data?
  • Can we still require only one pass through the
    data?
  • Can we increase the accuracy of the estimation
    for all aggregate queries?
  • Can we do this without prior knowledge of the
    query workload?

5
Approximate Pre-Aggregation
  • Uses a true random sampling that is not biased
    towards any query or workload.
  • Sample is combined with small set of statistics
    about the data that is gathered at same time that
    sampling is performed.

6
Approximate Pre-Aggregation (example)
  • Using a sample of 50 of the data (unshaded) to
    answer the query
  • select SUM(cmplaints)
  • from sample
  • where prof Smith
  • we get 36, and since the sample constitutes 50
    of the database tuples, we
  • estimate that 72 students complained about
    Professor Smith.
  • The actual number is 121 a relative error of
    40.5!
  • The issue is in the variance of students who
    complained each semester.
  • APA uses some additional information to decrease
    this error.

7
Approximate Pre-Aggregation (cont.)
  • Suppose we also know that the total number of
    complaints in the entire table is 148, and the
    total number of database tuples is 16.
  • Use the selection predicate (i.e. the where
    clause), and divide the data space into 2n
    quadrants.
  • In this example, n1 (profSmith), so there
    will be two quadrants profSmith and
    profltgtSmith.
  • For each of the quadrants, estimate the
    probability density function (pdf).
  • Use additional information to create constraints
    on the distributions.
  • In this example, we know that the number of
    complaints is 148 and the number of tuples is 16,
    which gives us a mean of 148/16 9.25
  • Then need to resolve any errors in the mean.
  • In this example, the mean of the profSmith
    quadrant is 4.5 and the mean of the
    profltgtSmith quadrant is 1.25, for a total
    mean of 5.75 significantly different from 9.25.
  • Use the Maximum Likelihood Estimation (MLE) to
    re-estimate the mean.
  • In this example, the mean of complaints for
    Professor Smith now becomes 8.52 (instead of
    4.5), which gives us an estimated total of 136.3
    (8.52 16) much closer to the actual 121 than
    72.

8
Maximum Likelihood Estimation
  • Let x be an observable outcome from a given
    experiment.
  • In our example, x might be the fact that our
    random sample predicts that the value of our
    relational aggregate SUM query is 72.
  • Let the pdf with respect to the observing outcome
    x be the following function, where the parameters
    are the hidden parameters that we wish to
    estimate (they describe the model we want to
    discover)

9
Maximum Likelihood Estimation (cont.)
  • In order to fit the hidden model parameters to
    the observation (or observations) we maximize the
    likelihood that our particular model produced the
    data (loglikelihood)

10
Maximum Likelihood Estimation (in APA)
  • Basic idea of APA is simply to find the best (or
    most likely) explanation for the sample which
    does not violate any of the known facts about the
    database.
  • Need to pose problem of approximate aggregation
    over categorical data as a MLE problem.
  • Three specific components of maximum likelihood
    that we need to describe in the context of APA
  • The experimental outcomes x1,x2,,xn
  • The model parameters
  • The pdf
  • Once we these three components have been defined,
    we have transformed the problem of estimation of
    aggregate functions over categorical data into a
    MLE problem, and we can begin the task of
    developing a method to solve the problem.

11
Maximum Likelihood Estimation (outcomes)
  • First, we describe how we obtain the outcomes
    x1,x2,,xn to postulate APA as a MLE problem. In
    APA, those outcomes are a set of predictions made
    by our sample.
  • Suppose we have the following aggregate query
  • select SUM(salary)
  • from Employee
  • where sexM and deptaccounting and
  • job_typesupervisor
  • We can number each of the clauses in the
    relational selection predicate from 1 to m. In
    this case
  • b1(sexmale), b2(deptaccounting) and
    b3(job_typesupervisor)
  • Also the negation of each of the clauses
  • Conceptually, the result is a data cube (which we
    will call 2m). Each of the boolean conditions
    corresponds to a single cell in the
    multidimensional data cube.

12
Maximum Likelihood Estimation (outcomes, cont.)
  • The outcomes x1,x2,,x2m for the MLE in APA are
    the results of the aggregate function in question
    with respect to each of the relational selection
    predicates in 2m, estimated using our sample.
  • Examples
  • We know that our sample-based estimate for
    SUM(salary) over b1 b2 b3 is 1.5M. b1 b2
    b3 is the first entry in 2m. This, 1.5M is used
    as the value of x1.
  • We know that our sample-based estimate for
    SUM(salary) over b1 b2 b3 is 1.1M. This
    is the fourth entry in 2m, thus x4 is 1.1M.
  • In this way, the random sample is used to
    estimate the value for each cell in the cube, and
    these values become the outcomes x1,x2,,x2m.

13
Maximum Likelihood Estimation (parameters)
  • Second, we need to describe the set of model
    parameters which we will attempt to estimate.
  • In APA, these parameters are defined to be the
    APA guess as to the real value of the aggregate
    function in question, with respect to each of the
    cells in the multidimensional data cube.
  • If xi is the value of cell i predicted by the
    sample, then the model parameter 0i is the APA
    maximum likelihood estimate for the correct value
    for the aggregate function applied to cell i.
  • Example
  • Assume we know the fact
  • (SUM(salary) where job_type!supervisor)
    2.3M
  • The relational selection predicate in this fact
    is (b1b2b3) v (b1b2b3) v (b1b2b3) v
    (b1b2b3). This is a disjunction of the
    third, fourth, seventh and eight predicates
    present in 2m.
  • Thus, this fact is equivalent to the constraint
    that 03040708 2.3M.

14
Maximum Likelihood Estimation (pdf)
  • Finally, we need to define the probability
    density function f, which will give us the
    likelihood that we would see the experimental
    observations x1,x2,,x2m, given model parameters
    01,02,,02m.
  • Example Assume we attempt to estimate the value
    for an aggregate of the form
  • select AGG(expression)
  • from TABLE
  • where (predicate)
  • Assume we have a sample size of n from a
    database of size db, and the estimated value of
    the query based on the sample is z (Hellerstein
    and Haas)
  • To make a long story short, the pdf is defined as
    follows

15
Performing the Estimation
  • Many ways to attempt to estimate the solution to
    a maximum likelihood problem.
  • Usually necessary to approximate or estimate
    because of the inherent intractability of
    discovering the most likely model in the general
    case.
  • Best known is the Expectation Maximization
    algorithm (EM).
  • Begins with an initial guess at the solution and
    then repeatedly refines the guess until it
    reaches a locally optimal solution
  • EM simply seeks to maximize the loglikelihood
    value, while APA needs to maximize within the
    constraints of the model parameters.
  • Instead, APA uses quadratic programming
    formulation.

16
Performing the Estimation (quadratic programming)
  • Quadratic programming is an extension of linear
    programming with the generalization that the
    objective function to maximize may contain
    products of two variables, and not simply
    scalars.
  • A key advantage of this is that many algorithms
    have been developed that efficiently solve
    problems posed as quadratic equations.
  • The ability of quadratic programming to
    incorporate constraints makes it ideal for APA.
  • The constraints for this quadratic programming
    formulation are simply the linear sums of the
    values 01,02,,02B.

17
Experiments
  • Six different approximate options, over eight
    real, high-dimensional data sets, were performed
  • Random sampling
  • Stratified sampling
  • APA0 store and use all 0-dimensional facts as
    constraints in the quadratic programming.
  • APA1 same as APA0, as well as 1-dimensional
    facts.
  • APA2 same as APA1, as well as 2-dimensional
    facts.
  • APA3 same as APA2, as well as 3-dimensional
    facts.
  • Wavelets
  • Results from AVG aggregation

18
Observations
  • Wavelets are unsuitable in this domain.
  • Random sampling better for COUNT queries.
  • Additional accuracy of APA2 and APA3 probably not
    worth the overhead also are impractical for
    numerical data as that would require joint
    distributions of numerical and categorical
    attributes (difficult).
  • APA0 and APA1 can easily be extended to handle
    numerical attributes.
  • Should be possible to easily extend APA to work
    across foreign key joins, using a technique for
    sampling from the results of joins (ex. join
    synopses).
  • Largely sidestepped issues associated with
    computational efficiency.
Write a Comment
User Comments (0)
About PowerShow.com