APPROXIMATE QUERY PROCESSING IN DATABASES

About This Presentation

Title:

APPROXIMATE QUERY PROCESSING IN DATABASES

Description:

APPROXIMATE QUERY PROCESSING IN DATABASES By: Jatinder Paul Introduction Decision Support Systems (DSS) What is Approximate Query Processing ? AQP Keeping query ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 38

Provided by: rangerUt4

Learn more at: https://ranger.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: APPROXIMATE QUERY PROCESSING IN DATABASES

1
APPROXIMATE QUERY PROCESSING IN DATABASES

By
Jatinder Paul

2
Introduction
User

DecisionSupport Systems(DSS)

SQL Query
Exact Answer
Long Response Times!

In recent years, advances in data collection and
management technologies
have led to proliferation of very large
databases.
Decision support applications such as Online
Analytical Processing (OLAP)
and data mining tools are gaining popularity.
Executing such applications on
large volumes of data can be resource
intensive.
For very large databases (multi-gigabyte )
processing even a single query
involves accessing enormous amounts of data,
leading to very high running
time.

3
What is Approximate Query Processing ? AQP

Keeping query processing time small is very
important in data mining and decision support
system (DSS).
In aggregate queries (involving SUM,COUNT,AVG
etc.) precision to last decimal is not needed
e.g., What is averages sale of a product in all
the states of US ?
Exactness of result is not very important .
Result can be given as approximate answer with
some error guarantee.
e.g. Average salary
59,000 /- 500 (with 95
confidence) in 10 seconds
vs. 59,152.25
in 10 minutes
Approximate Query Processing (AQP) techniques
sacrifice accuracy to improve the running time

4
Different AQP techniques

All AQP techniques involves building some kind of
synopses (summary) of the database.
AQP techniques differ in the way synopses is
build.
Sampling Most preferred and practically used
technique.
Randomized sampling Sampling
uniformly at random.
Deterministic sampling Selecting the
best sample that minimizes the total
error in the approximate answers.
Histogram Involves creating histograms which
represent the database.
Wavelets Still a research area !
Concentrating on various sampling techniques.

5
Basic Sampling Based AQP Architecture.

Pre-processing phase (offline) Sample is build .
Run time phase (online) Queries are rewritten to
run against the sample. The result is then scaled
to given the approximate answer.

6
Limitations of Uniform Sampling for Aggregation
Queries

We discuss two problems that adversely affect the
accuracy of sampling based estimations
presence of skewed data in aggregate values ,
the effect of low selectivity in selection
queries and ,
the presence of small groups in group-by
queries.
Discussions of these problems in details.

7
Effect of data skew

A skewed database is characterized by the
existence of certain tuples that are deviant from
the rest in terms of their contribution the
aggregate value.
These tuples are called outliers. These are the
values that contribute heavily to error in
uniform sampling.
Outliers in data results large variance .
Large variance gt Large standard
deviation and ,
Large standard deviation gtLarge error in
results.

8
Effect of data skew

Example Consider a relation R containing
Two columns ltProductId,
Revenuegt and
four records lt1, 10gt, lt2,
10gt, lt3,10gt, lt4, 1000gt.
and aggregate query Q SELECT SUM(Revenue)
FROM R.
Using a sample S of two records from R to answer
the query.
Query is answered by
Running it against S and
Scaling the result by a factor of two (since we
are using a 50 sample).
Consider a sample S1 lt1, 10gt, lt3, 10gt.
The estimated answer for Q using S1 is 40,
which is a severe underestimate of the actual
answer (1030).
Now consider another sample S2 lt1,10gt,
lt4,1000gt. The estimated answer for Q
using S2 is 2020, which is a significant
overestimate.
Thus, large variance in the aggregate column can
lead to large relative errors

9
Handling data skew Outlier -indexes

Main idea is deal with outliers separately ,and
sample from rest of the relation.
Partition the relation R into two sub relation-
Ro (outliers) and
RNO (non-outliers).
The query Q can now be considered as the UNION of
two sub queries -
Q applied to Ro and
Q applied to RNO (non-outliers).

10
Handling data skew Outlier -indexes

Preprocessing steps
1. Determine outliers specify the
sub-relation RO of the relation R
deemed to be the set of outliers.
2. Sample non-outliers select a
uniform random sample T of the
relation RNO.
Query processing steps.
1. Aggregate outliers apply the query to
the outliers in RO accessed via the
outlier-index.
2. Aggregate non-outliers Apply the query
to the sample T and extrapolate to
obtain an estimate of the query
result for RNO.
3. Combine aggregates combine the
approximate result for RNO with the
exact result for RO to obtain an
approximate result for R.

11
Outlier indexes Other Issues.

Since the database content changes over time,
this requires selection of outlier indexes and
samples to be refreshed appropriately. The
samples need to be refreshed periodically as
pre-computed samples can become stale with use.
An alternative is to do the sampling completely
online i.e., make it a part of query processing.
For the case of the COUNT aggregate
,outlier-indexing is not beneficial since there
is no variance among the data values.
Also , outlier indexing is not useful for
aggregate that depend on the rank order of tuples
rather than their actual values.
(e.g. Aggregates such as min ,max and median
)

12
Effect of low selectivity

Most queries involve selection conditions and/or
group-bys .
If the selectivity of a query is low (i.e. it
results in few records selection) then it
adversely impacts the accuracy of sampling based
estimation.
A selection query partitions the relation into
two sub-relations
tuples that satisfy the condition (relevant
sub-relation) and
tuples that do not.
If we sample uniformly from the relation, the
number of tuples that are sampled from the
relevant sub-relations will be proportional to
its size. If this relevant sample size is small
due to low selectivity of the query, it may lead
to large error.

13
Effect of small groups.

The effect of small groups is same as effect of
low selectivity .
The group-by queries also partition the relation
into numerous sub-relations (tuples that belong
to specific groups).
Thus for uniform sampling to perform well, the
relevant sub-relation should be large in size,
which is not the case in general

14
Using Workload Information Solution to low
selectivity and small groups

Approach to this problem is to use weighted
sampling (instead of uniform sampling) of data by
exploiting workload information while drawing the
sample.
The essential idea behind weighted sampling
scheme is to sample more from subsets of data
that are small in size but are important, i.e.
have high usage. This results in low error.
This approach is based on the fact that the usage
of a database is typically characterized by
considerable locality in the access pattern,
i.e., queries against the database access certain
parts of the data more than others.

15
Using Workload Information Solution to low
selectivity and small groups

The use of workload information involves the
following steps
1. Workload Collection Obtain a workload
consisting of representative queries
against the database. Modern database
systems provide tools to log queries
posed against the server (e.g. the
Profiler component of Microsoft SQL Sever).
2. Trace Query Patterns The workload can be
analyzed to obtain parsed
information, e.g., the set of selection
conditions that are posed.
3. Trace Tuple Usage The execution of the
workload reveals additional information
on usage of specific tuples, e.g.,
frequency of access to each tuple , the number of
queries in the workload for which it
passes the selection condition of the query.
4. Weighted Sampling Perform sampling by
taking into account weights of tuples
(from Step 3).

16
Weighted Sampling

Assuming that a tuple ti has weight wi, if the
tuple ti is required to answer wi of the queries
in the workload.
The normalized weight be wi defined as .
Tuple is accepted in the sample with the
probability
pi n.wi , where n is the sample size.
Thus the probability with which each tuple is
accepted in the sample varies from tuple to
tuple.
The inverse of this probability is the
multiplication factor associated with tuple used
while answering the query. Each aggregate
computed over this tuple gets multiplied by this
multiplication factor.

17
Problem with Group-By Queries

Decision support queries routinely segment the
data into groups.
For example, a group-by query on the U.S. census
database could be used to determine the per
capita income per state. However ,there can be a
huge discrepancy in the sizes of different
groups, e.g., the state of California has nearly
70 times the population of North Carolina.
As a result, a uniform random sample of the
relation will contain disproportionately fewer
tuples from the smaller groups, which leads to
poor accuracy for answers on those groups because
accuracy is highly dependent on the number of
sample tuples that belong to that group.
Standard error is inversely proportional to vn
for uniform sample.
n is the uniform sample random size.

18
Solution Congressional Samples

Congressional samples are hybrid union of uniform
and biased samples.
The strategy adopted is to divide the available
sample space X equally among the g groups , and
take a uniform random sample within each group.
Consider US Congress which is hybrid of House
and Senate.
House has representative from each from each
state in proportion to its population.
Senate has equal number of representative from
each state.
We now apply House and Senate scenario for
representing different groups.
House sample Uniform random sampling
from each group .
Senate sample Sample an equal number of
tuples from each group.

19
Congressional Samples

We define a strategy S1 as following
Divide the available sample space X equally among
the g groups , and take a uniform random sample
within each group
Congressional approach In this approach we
consider the entire set of possible group by
queries over a relation R.
Let be the set of non-empty groups under the
grouping G. The grouping G partitions the
relation R according to the cross-product of all
the grouping attributes this is the finest
possible partitioning for group-bys on R. Any
group h on any other grouping T G is the union
of one or more groups g from .
Constructing Congress,
1. Apply S1 on each T G.
2. Let be the set of non-empty groups
under the grouping T, and let
the number of such groups.
3. By S1, each of the non-empty groups in T
should get a uniform random sample
of X/mT tuples from the group.

20
Congressional Samples

Constructing Congress,
4. Thus for each subgroup g in of a group h
in T, the expected space
allocated to g is simply
5. Then, for each group g , take the
maximum over all T of Sg,T, as the sample size
for g, and scale it down to limit the space used
to X. The final formula is Sample Size (g)
6. For each group g in , select a uniform
random sample of size Sample Size(g).
Thus we have a stratified, biased sample in
which each group at the finest
partitioning is its own strata. Thus Congress
essentially guarantees that
both large and small groups in all groupings will
have a reasonable number
of samples.

where ng and nh are the number of tuples in g and
h respectively.
21
Rewriting for biased samples

Query rewriting involves two key steps
a) scaling up the aggregate expressions and
b) deriving error bounds on the estimate.
For each tuple, let its scale factor ScaleFactor
be the inverse sampling rate for its strata.
All the sample tuples belonging to a group will
have the same ScaleFactor. Thus key step in
scaling is efficiently associate each tuple with
its corresponding ScaleFactor.
There are two approaches to doing this
a) store the ScaleFactor(SF) with each tuple
in sample relation and
b) use a separate table to store the
ScaleFactors for the groups.
Each approach has its pros and cons.

22
Dynamic Sample Selection for AQP

All previous strategies for AQP uses single
sample with fixed bias and do not take advantage
of extra disk space.
Increasing the size of a sample stored on disk
increases the running time of a query executing
against that sample.
Dynamic Sampling gets around with this problem by
1.Creating a large sample containing a
family of differently biased
sub samples during pre-processing
phase.
2. Using only portion of the sample to
answer each query at runtime.
Because there are many different sub samples with
different biases available to choose from a
runtime, the chances increase that one of them
will be a good fit for any particular query
that is issued.
And since only a small portion of the overall
sample is used to answer query response time is
low.

23
Generic Dynamic Sample Selection Architecture

Pre-Processing Phase

Consists of two steps
The data distribution and query distribution (
e.g. workload information) is examined to
identify a set of biased samples to be created.
Samples are created and stored in database along
with the metadata.

Figure 2. Pre-processing phase
Division of database
Identifies the characteristics of each sample
24
Generic Dynamic Sample Selection Architecture

Runtime Phase

Two steps
Rewrites the query to run against sample tables.
Appropriate sample table(s) to use for a given
query Q is determined by comparing Q with the
metadata annotations for the sample.

Figure 3. Runtime phase

We need algorithm to decide
Which samples are to be built during
pre-processing .
Which samples are to be used for query answering
during runtime phase.
It should be possible to quickly determine which
of various samples to use to
answer that query.

25
Small Group Sampling A dynamic sample selection
technique

Small group sampling targets most common type of
analysis queries , aggregation queries with
groups-bys.
Basic idea Uniform sampling does performs well
for larger group-by query
but for small groups uniform sampling performs
poorly since do not have enough representative
from small groups.
The small group sampling approach uses a
combination of
a uniform random sample (overall sample) over
large groups.
One or more small group tables that contain only
rows from small groups. The small group tables
are not downscaled.

26
Small Group Sampling Algorithm Description

Pre-processing phase
The pre-processing algorithm takes two input
parameter
Base sampling rate, r, which determines the size
of the uniform random sample that is created
(i.e. overall sample).
Small group fraction t, which determines the
maximum size of each small group sample table.
Let, N number of rows in the database.
C the set of columns in the database.
The pre-processing algorithm can be implemented
efficiently by making just
two scans
1) First scan identifies the frequently
occurring values for each column an
their approximate frequencies.
2) Second scan, the small group tables for each
column in S is constructed
along with the overall sample.

27
Small Group Sampling Algorithm Description

Pre-processing phase Algorithm
Final pass of algorithm
1. Initially , the set S is initialized to C.
2. Count the number of occurrences of each
distinct value in each column of
the database and determine the common values
for each column.
( can be done using separate hash table).
3. If the number of distinct values for a column
exceeds a threshold u, then
remove the that column from S and stop
maintaining its count.
L(C) is defined as the minimum set of values from
C whose frequencies sum to at least
N(1 - t).
Second pass of algorithm
1. Determine the set of common values L(C) for
each column C.
2. Rows with values from the set L(C) will not
be included in the small group table for C,
but rows with all other values will be
there are at most Nt such rows.
3. If column C has no small groups, remove it
from S.
4. After computing L(C) for every C S, the
algorithm creates a metadata table which
contains a mapping from each column name to
an unique index between 0 and S - 1.

28
Pre-Processing Phase Algorithm

The final step in pre-processing is to make a
second scan of the database to construct the
sample tables.
Each row containing an uncommon value for one or
more columns (i.e. a value not in the set L(C))
is added to the small group sample table for the
appropriate columns.
At the same time as the small group tables are
being constructed, the preprocessing algorithm
also creates the overall sample, using reservoir
sampling to maintain a uniform random sample of
rN tuples.
Each row that is added to either a small group
table or the overall sample is tagged with an
extra bit-mask field (of length S) indicating
the set of small group tables to which that row
was added. This field is used during runtime
query processing to avoid double-counting rows
that are assigned to multiple sample tables.

29
Small Group Sampling Algorithm Description

Runtime phase
When a query arrives at runtime, it is re-written
to run against the sample tables instead of the
base fact table.
Each query is executed against the overall
sample, scaling the aggregate values by the
inverse of the sampling rate r.
In addition, for each column C ? S in the querys
group-by list, the query is executed against
that columns small group table. The aggregate
values are unscaled when executing against the
small group sample tables.
Finally, the results from the various sample
queries are aggregated together into a single
approximate query answer. Since a row can be
included in multiple sample tables, the
re-written queries include filters that avoid
double-counting rows

30
Small group sampling v/s Congressional sampling
Congressional Sampling Small group sampling
1. It creates only single sample, hence sample must necessarily be very general purpose in nature and only loosely for any appropriate query. Uses dynamic sample selection architecture and thus can have benefits of more specialized samples that are each tuned for a narrower , more specific class of queries.
2.The preprocessing time required by congressional sampling is proportional to the number of different combinations of grouping columns, which is exponential in the number of columns. This renders it impractical for typical data warehouses that have dozens or hundreds of potential grouping columns The pre-processing time for small group sampling is linear in the number of columns in the database
31
Problem with Joins

Approximate query strategies discussed so far
uses some sort of sample
( called base sample) using uniform random
sampling.
The use of base samples to estimate the output of
a join of two or more relations, however, can
produce a poor quality approximation. This is for
the following two reasons
1. Non-Uniform Result Sample In general, the
join of two uniform random base samples is not a
uniform random sample of the output of the join.
In most cases, this non-uniformity significantly
degrades the accuracy of the answer and the
confidence bounds.
2. Small Join Result Sizes The join of two
random samples typically has
very few tuples, even when the actual
join selectivity is fairly high. This
can lead to both inaccurate answers
and very poor confidence bounds
since they critically depend on the query
result size.
Thus, it is in general impossible to produce good
quality approximate answers using samples on the
base relations alone.

32
Solution Join Synopses

We need to pre-compute samples of join results
for making quality answer possible.
First , a naïve solution is to pre-compute such
samples is to execute all possible join queries
of interest and collect samples of their results.
But it is not feasible since it is too expensive
to compute and maintain.
One strategy is to compute samples of the results
of a small set of distinguished joins , which can
be used to obtain random samples of all possible
joins in the schema. The samples of these
distinguished joins as join synopses.
Join synopses is a solution for queries with only
foreign key joins.

33
Join Synopses

Lemma J.1 The sub-graph of G on the k nodes in
any k-way foreign key join must be a connected
sub-graph with a single root node.
We denote the relation corresponding to the
root node as the source relation for the k-way
foreign key join.
Lemma J.2 There is a 1-1 correspondence between
tuples in a relation r1 and tuples in any k-way
foreign key join with source relation r1.
For each relation r, there is some
maximum foreign key join
(i.e., having the largest number of
relations)
with r as the source relation.
For example., in Figure,
is
the maximum foreign key join with
source relation C.

34
Join Synopses

Definition of Join synopses For each node u in
G, corresponding to a relation r1.
Let ( u) to be the output of the maximum
foreign key join with
source r1.
Let be a uniform random sample of r1.
A join synopsis, , to be the
output of
The join synopses of a schema consists of
for all u in G.
Join synopses are also referred as join samples.

35
Join Synopses

A single join synopses can be used for a large
number of distinct joins.
Lemma J.3 From a single join synopsis for a
node whose maximum foreign key join has
relations, we can extract a uniform random sample
of the output of between and
distinct foreign key joins.
In other words, a join synopsis of relation r can
be used to provide a random samples of any join
involving r and one or more of its descendants.
Since Lemma J.2 fails to apply in general for any
relation other than the source relation, the
joining tuples in any relation r other than the
source relation will not in general be a uniform
random sample of r. Thus distinct join synopses
are needed for each node/relation.
Since tuples in join synopses are the results of
multi-way joins, a possible concern is that they
will be too large because they have many columns.
To reduce the columns stored for tuples in join
synopses, we can eliminate redundant columns (for
example, join columns) and only store columns of
interest. Small relations can be stored in their
entirety, rather than as part of join synopses.

36
References

Surajit Chaudhuri, Gautam Das, Mayur Datar,
Rajeev Motwani, Vivek Narasayya Overcoming
Limitations of Sampling for Aggregation Queries.
ICDE 2001.
Surajit Chaudhuri, Gautam Das, Vivek Narasayya A
Robust, Optimization-Based Approach for
Approximate Answering of Aggregate Queries.
SIGMOD Conference 2001.
Brian Babcock, Surajit Chaudhuri, Gautam Das
Dynamic Sample Selection for Approximate Query
Processing. SIGMOD Conference 2003.
S. Acharya, P. B. Gibbons, V. Poosala, and S.
Ramaswamy.Join synopses for approximate query
answering. In Proc. ACM SIGMOD International
Conf. on Management of Data, pages 275-286, June
1999.
Gautam Das Survey of Approximate Query
Processing Techniques. (Invited Tutorial) SSDBM
2003
P. B. Gibbons, S. Acharya, and V. PoosaIa. Aqua
Approximate Query Answering System
Yong Yao and Johannes Gehrke, Query Processing
for sensor network