Title: Robust Estimation With Sampling and Approximate PreAggregation
1Robust Estimation With Sampling and Approximate
Pre-Aggregation
- Author Christopher Jermaine
- Presented by Bill Eberle
2Overview
- Issues and Questions
- Approximate Pre-Aggregation (APA)
- Maximum Likelihood Estimation
- Performing the Estimation
- Experiments
- Observations
3Issues
- Assumes numerical or discretized attributes
- Many data warehouse (for DSS) attributes are
categorical (ex. country France) - Can not handle too many dimensions
- Stratified sampling (used to handle variance in
values over which aggregation is performed) - Assumes workload is known
- Targeting particular queries means other queries
suffer
4Questions
- Can we decrease the effect of random sampling
variance on categorical (or mixed) data? - Can we still require only one pass through the
data? - Can we increase the accuracy of the estimation
for all aggregate queries? - Can we do this without prior knowledge of the
query workload?
5Approximate Pre-Aggregation
- Uses a true random sampling that is not biased
towards any query or workload. - Sample is combined with small set of statistics
about the data that is gathered at same time that
sampling is performed.
6Approximate Pre-Aggregation (example)
- Using a sample of 50 of the data (unshaded) to
answer the query - select SUM(cmplaints)
- from sample
- where prof Smith
- we get 36, and since the sample constitutes 50
of the database tuples, we - estimate that 72 students complained about
Professor Smith. - The actual number is 121 a relative error of
40.5! - The issue is in the variance of students who
complained each semester. - APA uses some additional information to decrease
this error.
7Approximate Pre-Aggregation (cont.)
- Suppose we also know that the total number of
complaints in the entire table is 148, and the
total number of database tuples is 16. - Use the selection predicate (i.e. the where
clause), and divide the data space into 2n
quadrants. - In this example, n1 (profSmith), so there
will be two quadrants profSmith and
profltgtSmith. - For each of the quadrants, estimate the
probability density function (pdf). - Use additional information to create constraints
on the distributions. - In this example, we know that the number of
complaints is 148 and the number of tuples is 16,
which gives us a mean of 148/16 9.25 - Then need to resolve any errors in the mean.
- In this example, the mean of the profSmith
quadrant is 4.5 and the mean of the
profltgtSmith quadrant is 1.25, for a total
mean of 5.75 significantly different from 9.25. - Use the Maximum Likelihood Estimation (MLE) to
re-estimate the mean. - In this example, the mean of complaints for
Professor Smith now becomes 8.52 (instead of
4.5), which gives us an estimated total of 136.3
(8.52 16) much closer to the actual 121 than
72.
8Maximum Likelihood Estimation
- Let x be an observable outcome from a given
experiment. - In our example, x might be the fact that our
random sample predicts that the value of our
relational aggregate SUM query is 72. - Let the pdf with respect to the observing outcome
x be the following function, where the parameters
are the hidden parameters that we wish to
estimate (they describe the model we want to
discover)
9Maximum Likelihood Estimation (cont.)
- In order to fit the hidden model parameters to
the observation (or observations) we maximize the
likelihood that our particular model produced the
data (loglikelihood)
10Maximum Likelihood Estimation (in APA)
- Basic idea of APA is simply to find the best (or
most likely) explanation for the sample which
does not violate any of the known facts about the
database. - Need to pose problem of approximate aggregation
over categorical data as a MLE problem. - Three specific components of maximum likelihood
that we need to describe in the context of APA - The experimental outcomes x1,x2,,xn
- The model parameters
- The pdf
- Once we these three components have been defined,
we have transformed the problem of estimation of
aggregate functions over categorical data into a
MLE problem, and we can begin the task of
developing a method to solve the problem.
11Maximum Likelihood Estimation (outcomes)
- First, we describe how we obtain the outcomes
x1,x2,,xn to postulate APA as a MLE problem. In
APA, those outcomes are a set of predictions made
by our sample. - Suppose we have the following aggregate query
- select SUM(salary)
- from Employee
- where sexM and deptaccounting and
- job_typesupervisor
- We can number each of the clauses in the
relational selection predicate from 1 to m. In
this case - b1(sexmale), b2(deptaccounting) and
b3(job_typesupervisor) - Also the negation of each of the clauses
- Conceptually, the result is a data cube (which we
will call 2m). Each of the boolean conditions
corresponds to a single cell in the
multidimensional data cube.
12Maximum Likelihood Estimation (outcomes, cont.)
- The outcomes x1,x2,,x2m for the MLE in APA are
the results of the aggregate function in question
with respect to each of the relational selection
predicates in 2m, estimated using our sample. - Examples
- We know that our sample-based estimate for
SUM(salary) over b1 b2 b3 is 1.5M. b1 b2
b3 is the first entry in 2m. This, 1.5M is used
as the value of x1. - We know that our sample-based estimate for
SUM(salary) over b1 b2 b3 is 1.1M. This
is the fourth entry in 2m, thus x4 is 1.1M. - In this way, the random sample is used to
estimate the value for each cell in the cube, and
these values become the outcomes x1,x2,,x2m.
13Maximum Likelihood Estimation (parameters)
- Second, we need to describe the set of model
parameters which we will attempt to estimate. - In APA, these parameters are defined to be the
APA guess as to the real value of the aggregate
function in question, with respect to each of the
cells in the multidimensional data cube. - If xi is the value of cell i predicted by the
sample, then the model parameter 0i is the APA
maximum likelihood estimate for the correct value
for the aggregate function applied to cell i. - Example
- Assume we know the fact
- (SUM(salary) where job_type!supervisor)
2.3M - The relational selection predicate in this fact
is (b1b2b3) v (b1b2b3) v (b1b2b3) v
(b1b2b3). This is a disjunction of the
third, fourth, seventh and eight predicates
present in 2m. - Thus, this fact is equivalent to the constraint
that 03040708 2.3M.
14Maximum Likelihood Estimation (pdf)
- Finally, we need to define the probability
density function f, which will give us the
likelihood that we would see the experimental
observations x1,x2,,x2m, given model parameters
01,02,,02m. - Example Assume we attempt to estimate the value
for an aggregate of the form - select AGG(expression)
- from TABLE
- where (predicate)
- Assume we have a sample size of n from a
database of size db, and the estimated value of
the query based on the sample is z (Hellerstein
and Haas) -
- To make a long story short, the pdf is defined as
follows -
15Performing the Estimation
- Many ways to attempt to estimate the solution to
a maximum likelihood problem. - Usually necessary to approximate or estimate
because of the inherent intractability of
discovering the most likely model in the general
case. - Best known is the Expectation Maximization
algorithm (EM). - Begins with an initial guess at the solution and
then repeatedly refines the guess until it
reaches a locally optimal solution - EM simply seeks to maximize the loglikelihood
value, while APA needs to maximize within the
constraints of the model parameters. - Instead, APA uses quadratic programming
formulation.
16Performing the Estimation (quadratic programming)
- Quadratic programming is an extension of linear
programming with the generalization that the
objective function to maximize may contain
products of two variables, and not simply
scalars. - A key advantage of this is that many algorithms
have been developed that efficiently solve
problems posed as quadratic equations. - The ability of quadratic programming to
incorporate constraints makes it ideal for APA. - The constraints for this quadratic programming
formulation are simply the linear sums of the
values 01,02,,02B.
17Experiments
- Six different approximate options, over eight
real, high-dimensional data sets, were performed - Random sampling
- Stratified sampling
- APA0 store and use all 0-dimensional facts as
constraints in the quadratic programming. - APA1 same as APA0, as well as 1-dimensional
facts. - APA2 same as APA1, as well as 2-dimensional
facts. - APA3 same as APA2, as well as 3-dimensional
facts. - Wavelets
- Results from AVG aggregation
18Observations
- Wavelets are unsuitable in this domain.
- Random sampling better for COUNT queries.
- Additional accuracy of APA2 and APA3 probably not
worth the overhead also are impractical for
numerical data as that would require joint
distributions of numerical and categorical
attributes (difficult). - APA0 and APA1 can easily be extended to handle
numerical attributes. - Should be possible to easily extend APA to work
across foreign key joins, using a technique for
sampling from the results of joins (ex. join
synopses). - Largely sidestepped issues associated with
computational efficiency.