Title: APPROXIMATE QUERY PROCESSING IN DATABASES
1APPROXIMATE QUERY PROCESSING IN DATABASES
2Introduction
User
- DecisionSupport Systems(DSS)
SQL Query
Exact Answer
Long Response Times!
- In recent years, advances in data collection and
management technologies - have led to proliferation of very large
databases. - Decision support applications such as Online
Analytical Processing (OLAP) - and data mining tools are gaining popularity.
Executing such applications on - large volumes of data can be resource
intensive. - For very large databases (multi-gigabyte )
processing even a single query - involves accessing enormous amounts of data,
leading to very high running - time.
3What is Approximate Query Processing ? AQP
- Keeping query processing time small is very
important in data mining and decision support
system (DSS). - In aggregate queries (involving SUM,COUNT,AVG
etc.) precision to last decimal is not needed - e.g., What is averages sale of a product in all
the states of US ? - Exactness of result is not very important .
Result can be given as approximate answer with
some error guarantee. - e.g. Average salary
- 59,000 /- 500 (with 95
confidence) in 10 seconds - vs. 59,152.25
in 10 minutes - Approximate Query Processing (AQP) techniques
sacrifice accuracy to improve the running time
4Different AQP techniques
- All AQP techniques involves building some kind of
synopses (summary) of the database. - AQP techniques differ in the way synopses is
build. - Sampling Most preferred and practically used
technique. - Randomized sampling Sampling
uniformly at random. - Deterministic sampling Selecting the
best sample that minimizes the total - error in the approximate answers.
-
- Histogram Involves creating histograms which
represent the database. - Wavelets Still a research area !
- Concentrating on various sampling techniques.
5Basic Sampling Based AQP Architecture.
- Pre-processing phase (offline) Sample is build .
- Run time phase (online) Queries are rewritten to
run against the sample. The result is then scaled
to given the approximate answer.
6Limitations of Uniform Sampling for Aggregation
Queries
- We discuss two problems that adversely affect the
accuracy of sampling based estimations -
- presence of skewed data in aggregate values ,
- the effect of low selectivity in selection
queries and , - the presence of small groups in group-by
queries. - Discussions of these problems in details.
7Effect of data skew
- A skewed database is characterized by the
existence of certain tuples that are deviant from
the rest in terms of their contribution the
aggregate value. - These tuples are called outliers. These are the
values that contribute heavily to error in
uniform sampling. - Outliers in data results large variance .
-
- Large variance gt Large standard
deviation and , -
- Large standard deviation gtLarge error in
results.
8Effect of data skew
- Example Consider a relation R containing
- Two columns ltProductId,
Revenuegt and - four records lt1, 10gt, lt2,
10gt, lt3,10gt, lt4, 1000gt. - and aggregate query Q SELECT SUM(Revenue)
FROM R. -
- Using a sample S of two records from R to answer
the query. - Query is answered by
- Running it against S and
- Scaling the result by a factor of two (since we
are using a 50 sample). - Consider a sample S1 lt1, 10gt, lt3, 10gt.
- The estimated answer for Q using S1 is 40,
which is a severe underestimate of the actual
answer (1030). - Now consider another sample S2 lt1,10gt,
lt4,1000gt. The estimated answer for Q - using S2 is 2020, which is a significant
overestimate. - Thus, large variance in the aggregate column can
lead to large relative errors
9Handling data skew Outlier -indexes
- Main idea is deal with outliers separately ,and
sample from rest of the relation. - Partition the relation R into two sub relation-
- Ro (outliers) and
- RNO (non-outliers).
- The query Q can now be considered as the UNION of
two sub queries - - Q applied to Ro and
- Q applied to RNO (non-outliers).
10Handling data skew Outlier -indexes
- Preprocessing steps
- 1. Determine outliers specify the
sub-relation RO of the relation R - deemed to be the set of outliers.
- 2. Sample non-outliers select a
uniform random sample T of the - relation RNO.
- Query processing steps.
- 1. Aggregate outliers apply the query to
the outliers in RO accessed via the - outlier-index.
- 2. Aggregate non-outliers Apply the query
to the sample T and extrapolate to - obtain an estimate of the query
result for RNO. - 3. Combine aggregates combine the
approximate result for RNO with the - exact result for RO to obtain an
approximate result for R.
11Outlier indexes Other Issues.
- Since the database content changes over time,
this requires selection of outlier indexes and
samples to be refreshed appropriately. The
samples need to be refreshed periodically as
pre-computed samples can become stale with use. - An alternative is to do the sampling completely
online i.e., make it a part of query processing. - For the case of the COUNT aggregate
,outlier-indexing is not beneficial since there
is no variance among the data values. - Also , outlier indexing is not useful for
aggregate that depend on the rank order of tuples
rather than their actual values. - (e.g. Aggregates such as min ,max and median
)
12Effect of low selectivity
- Most queries involve selection conditions and/or
group-bys . - If the selectivity of a query is low (i.e. it
results in few records selection) then it
adversely impacts the accuracy of sampling based
estimation. - A selection query partitions the relation into
two sub-relations - tuples that satisfy the condition (relevant
sub-relation) and - tuples that do not.
- If we sample uniformly from the relation, the
number of tuples that are sampled from the
relevant sub-relations will be proportional to
its size. If this relevant sample size is small
due to low selectivity of the query, it may lead
to large error.
13Effect of small groups.
- The effect of small groups is same as effect of
low selectivity . - The group-by queries also partition the relation
into numerous sub-relations (tuples that belong
to specific groups). - Thus for uniform sampling to perform well, the
relevant sub-relation should be large in size,
which is not the case in general
14Using Workload Information Solution to low
selectivity and small groups
- Approach to this problem is to use weighted
sampling (instead of uniform sampling) of data by
exploiting workload information while drawing the
sample. - The essential idea behind weighted sampling
scheme is to sample more from subsets of data
that are small in size but are important, i.e.
have high usage. This results in low error. - This approach is based on the fact that the usage
of a database is typically characterized by
considerable locality in the access pattern,
i.e., queries against the database access certain
parts of the data more than others.
15Using Workload Information Solution to low
selectivity and small groups
- The use of workload information involves the
following steps - 1. Workload Collection Obtain a workload
consisting of representative queries - against the database. Modern database
systems provide tools to log queries - posed against the server (e.g. the
Profiler component of Microsoft SQL Sever). - 2. Trace Query Patterns The workload can be
analyzed to obtain parsed - information, e.g., the set of selection
conditions that are posed. - 3. Trace Tuple Usage The execution of the
workload reveals additional information - on usage of specific tuples, e.g.,
frequency of access to each tuple , the number of
- queries in the workload for which it
passes the selection condition of the query. - 4. Weighted Sampling Perform sampling by
taking into account weights of tuples - (from Step 3).
-
16Weighted Sampling
- Assuming that a tuple ti has weight wi, if the
tuple ti is required to answer wi of the queries
in the workload. - The normalized weight be wi defined as .
- Tuple is accepted in the sample with the
probability - pi n.wi , where n is the sample size.
- Thus the probability with which each tuple is
accepted in the sample varies from tuple to
tuple. - The inverse of this probability is the
multiplication factor associated with tuple used
while answering the query. Each aggregate
computed over this tuple gets multiplied by this
multiplication factor.
17Problem with Group-By Queries
- Decision support queries routinely segment the
data into groups. - For example, a group-by query on the U.S. census
database could be used to determine the per
capita income per state. However ,there can be a
huge discrepancy in the sizes of different
groups, e.g., the state of California has nearly
70 times the population of North Carolina. - As a result, a uniform random sample of the
relation will contain disproportionately fewer
tuples from the smaller groups, which leads to
poor accuracy for answers on those groups because
accuracy is highly dependent on the number of
sample tuples that belong to that group. - Standard error is inversely proportional to vn
for uniform sample. - n is the uniform sample random size.
18Solution Congressional Samples
- Congressional samples are hybrid union of uniform
and biased samples. - The strategy adopted is to divide the available
sample space X equally among the g groups , and
take a uniform random sample within each group. - Consider US Congress which is hybrid of House
and Senate. - House has representative from each from each
state in proportion to its population. - Senate has equal number of representative from
each state. - We now apply House and Senate scenario for
representing different groups. - House sample Uniform random sampling
from each group . - Senate sample Sample an equal number of
tuples from each group.
19Congressional Samples
- We define a strategy S1 as following
- Divide the available sample space X equally among
the g groups , and take a uniform random sample
within each group - Congressional approach In this approach we
consider the entire set of possible group by
queries over a relation R. - Let be the set of non-empty groups under the
grouping G. The grouping G partitions the
relation R according to the cross-product of all
the grouping attributes this is the finest
possible partitioning for group-bys on R. Any
group h on any other grouping T G is the union
of one or more groups g from . - Constructing Congress,
- 1. Apply S1 on each T G.
- 2. Let be the set of non-empty groups
under the grouping T, and let - the number of such groups.
- 3. By S1, each of the non-empty groups in T
should get a uniform random sample - of X/mT tuples from the group.
-
20Congressional Samples
- Constructing Congress,
- 4. Thus for each subgroup g in of a group h
in T, the expected space - allocated to g is simply
- 5. Then, for each group g , take the
maximum over all T of Sg,T, as the sample size
for g, and scale it down to limit the space used
to X. The final formula is Sample Size (g) - 6. For each group g in , select a uniform
random sample of size Sample Size(g). - Thus we have a stratified, biased sample in
which each group at the finest - partitioning is its own strata. Thus Congress
essentially guarantees that - both large and small groups in all groupings will
have a reasonable number - of samples.
where ng and nh are the number of tuples in g and
h respectively.
21Rewriting for biased samples
- Query rewriting involves two key steps
- a) scaling up the aggregate expressions and
- b) deriving error bounds on the estimate.
- For each tuple, let its scale factor ScaleFactor
be the inverse sampling rate for its strata. - All the sample tuples belonging to a group will
have the same ScaleFactor. Thus key step in
scaling is efficiently associate each tuple with
its corresponding ScaleFactor. - There are two approaches to doing this
- a) store the ScaleFactor(SF) with each tuple
in sample relation and - b) use a separate table to store the
ScaleFactors for the groups. - Each approach has its pros and cons.
22Dynamic Sample Selection for AQP
- All previous strategies for AQP uses single
sample with fixed bias and do not take advantage
of extra disk space. - Increasing the size of a sample stored on disk
increases the running time of a query executing
against that sample. - Dynamic Sampling gets around with this problem by
- 1.Creating a large sample containing a
family of differently biased - sub samples during pre-processing
phase. - 2. Using only portion of the sample to
answer each query at runtime. - Because there are many different sub samples with
different biases available to choose from a
runtime, the chances increase that one of them
will be a good fit for any particular query
that is issued. - And since only a small portion of the overall
sample is used to answer query response time is
low.
23Generic Dynamic Sample Selection Architecture
- Consists of two steps
- The data distribution and query distribution (
e.g. workload information) is examined to
identify a set of biased samples to be created. - Samples are created and stored in database along
with the metadata.
Figure 2. Pre-processing phase
Division of database
Identifies the characteristics of each sample
24Generic Dynamic Sample Selection Architecture
- Two steps
- Rewrites the query to run against sample tables.
- Appropriate sample table(s) to use for a given
query Q is determined by comparing Q with the
metadata annotations for the sample.
Figure 3. Runtime phase
- We need algorithm to decide
- Which samples are to be built during
pre-processing . - Which samples are to be used for query answering
during runtime phase. - It should be possible to quickly determine which
of various samples to use to - answer that query.
25Small Group Sampling A dynamic sample selection
technique
- Small group sampling targets most common type of
analysis queries , aggregation queries with
groups-bys. - Basic idea Uniform sampling does performs well
for larger group-by query - but for small groups uniform sampling performs
poorly since do not have enough representative
from small groups. -
- The small group sampling approach uses a
combination of - a uniform random sample (overall sample) over
large groups. - One or more small group tables that contain only
rows from small groups. The small group tables
are not downscaled. -
26Small Group Sampling Algorithm Description
- Pre-processing phase
- The pre-processing algorithm takes two input
parameter - Base sampling rate, r, which determines the size
of the uniform random sample that is created
(i.e. overall sample). - Small group fraction t, which determines the
maximum size of each small group sample table. - Let, N number of rows in the database.
- C the set of columns in the database.
- The pre-processing algorithm can be implemented
efficiently by making just - two scans
- 1) First scan identifies the frequently
occurring values for each column an - their approximate frequencies.
- 2) Second scan, the small group tables for each
column in S is constructed - along with the overall sample.
27Small Group Sampling Algorithm Description
- Pre-processing phase Algorithm
- Final pass of algorithm
- 1. Initially , the set S is initialized to C.
- 2. Count the number of occurrences of each
distinct value in each column of - the database and determine the common values
for each column. - ( can be done using separate hash table).
- 3. If the number of distinct values for a column
exceeds a threshold u, then - remove the that column from S and stop
maintaining its count. - L(C) is defined as the minimum set of values from
C whose frequencies sum to at least - N(1 - t).
- Second pass of algorithm
- 1. Determine the set of common values L(C) for
each column C. - 2. Rows with values from the set L(C) will not
be included in the small group table for C, - but rows with all other values will be
there are at most Nt such rows. - 3. If column C has no small groups, remove it
from S. - 4. After computing L(C) for every C S, the
algorithm creates a metadata table which - contains a mapping from each column name to
an unique index between 0 and S - 1.
28Pre-Processing Phase Algorithm
- The final step in pre-processing is to make a
second scan of the database to construct the
sample tables. - Each row containing an uncommon value for one or
more columns (i.e. a value not in the set L(C))
is added to the small group sample table for the
appropriate columns. - At the same time as the small group tables are
being constructed, the preprocessing algorithm
also creates the overall sample, using reservoir
sampling to maintain a uniform random sample of
rN tuples. - Each row that is added to either a small group
table or the overall sample is tagged with an
extra bit-mask field (of length S) indicating
the set of small group tables to which that row
was added. This field is used during runtime
query processing to avoid double-counting rows
that are assigned to multiple sample tables.
29Small Group Sampling Algorithm Description
- Runtime phase
- When a query arrives at runtime, it is re-written
to run against the sample tables instead of the
base fact table. - Each query is executed against the overall
sample, scaling the aggregate values by the
inverse of the sampling rate r. - In addition, for each column C ? S in the querys
group-by list, the query is executed against
that columns small group table. The aggregate
values are unscaled when executing against the
small group sample tables. - Finally, the results from the various sample
queries are aggregated together into a single
approximate query answer. Since a row can be
included in multiple sample tables, the
re-written queries include filters that avoid
double-counting rows
30Small group sampling v/s Congressional sampling
Congressional Sampling Small group sampling
1. It creates only single sample, hence sample must necessarily be very general purpose in nature and only loosely for any appropriate query. Uses dynamic sample selection architecture and thus can have benefits of more specialized samples that are each tuned for a narrower , more specific class of queries.
2.The preprocessing time required by congressional sampling is proportional to the number of different combinations of grouping columns, which is exponential in the number of columns. This renders it impractical for typical data warehouses that have dozens or hundreds of potential grouping columns The pre-processing time for small group sampling is linear in the number of columns in the database
31Problem with Joins
- Approximate query strategies discussed so far
uses some sort of sample - ( called base sample) using uniform random
sampling. - The use of base samples to estimate the output of
a join of two or more relations, however, can
produce a poor quality approximation. This is for
the following two reasons - 1. Non-Uniform Result Sample In general, the
join of two uniform random base samples is not a
uniform random sample of the output of the join.
In most cases, this non-uniformity significantly
degrades the accuracy of the answer and the
confidence bounds. - 2. Small Join Result Sizes The join of two
random samples typically has - very few tuples, even when the actual
join selectivity is fairly high. This - can lead to both inaccurate answers
and very poor confidence bounds - since they critically depend on the query
result size. - Thus, it is in general impossible to produce good
quality approximate answers using samples on the
base relations alone.
32Solution Join Synopses
- We need to pre-compute samples of join results
for making quality answer possible. - First , a naïve solution is to pre-compute such
samples is to execute all possible join queries
of interest and collect samples of their results. - But it is not feasible since it is too expensive
to compute and maintain. - One strategy is to compute samples of the results
of a small set of distinguished joins , which can
be used to obtain random samples of all possible
joins in the schema. The samples of these
distinguished joins as join synopses. - Join synopses is a solution for queries with only
foreign key joins.
33Join Synopses
- Lemma J.1 The sub-graph of G on the k nodes in
any k-way foreign key join must be a connected
sub-graph with a single root node. - We denote the relation corresponding to the
root node as the source relation for the k-way
foreign key join. - Lemma J.2 There is a 1-1 correspondence between
tuples in a relation r1 and tuples in any k-way
foreign key join with source relation r1. - For each relation r, there is some
- maximum foreign key join
- (i.e., having the largest number of
relations) - with r as the source relation.
- For example., in Figure,
is - the maximum foreign key join with
- source relation C.
34Join Synopses
- Definition of Join synopses For each node u in
G, corresponding to a relation r1. - Let ( u) to be the output of the maximum
foreign key join with
source r1. - Let be a uniform random sample of r1.
- A join synopsis, , to be the
output of - The join synopses of a schema consists of
for all u in G. - Join synopses are also referred as join samples.
35Join Synopses
- A single join synopses can be used for a large
number of distinct joins. - Lemma J.3 From a single join synopsis for a
node whose maximum foreign key join has
relations, we can extract a uniform random sample
of the output of between and
distinct foreign key joins. - In other words, a join synopsis of relation r can
be used to provide a random samples of any join
involving r and one or more of its descendants. - Since Lemma J.2 fails to apply in general for any
relation other than the source relation, the
joining tuples in any relation r other than the
source relation will not in general be a uniform
random sample of r. Thus distinct join synopses
are needed for each node/relation. - Since tuples in join synopses are the results of
multi-way joins, a possible concern is that they
will be too large because they have many columns.
To reduce the columns stored for tuples in join
synopses, we can eliminate redundant columns (for
example, join columns) and only store columns of
interest. Small relations can be stored in their
entirety, rather than as part of join synopses.
36References
- Surajit Chaudhuri, Gautam Das, Mayur Datar,
Rajeev Motwani, Vivek Narasayya Overcoming
Limitations of Sampling for Aggregation Queries.
ICDE 2001. - Surajit Chaudhuri, Gautam Das, Vivek Narasayya A
Robust, Optimization-Based Approach for
Approximate Answering of Aggregate Queries.
SIGMOD Conference 2001. - Brian Babcock, Surajit Chaudhuri, Gautam Das
Dynamic Sample Selection for Approximate Query
Processing. SIGMOD Conference 2003. - S. Acharya, P. B. Gibbons, V. Poosala, and S.
Ramaswamy.Join synopses for approximate query
answering. In Proc. ACM SIGMOD International
Conf. on Management of Data, pages 275-286, June
1999. - Gautam Das Survey of Approximate Query
Processing Techniques. (Invited Tutorial) SSDBM
2003 - P. B. Gibbons, S. Acharya, and V. PoosaIa. Aqua
Approximate Query Answering System - Yong Yao and Johannes Gehrke, Query Processing
for sensor network
37