Title: Approximate Query Processing: Techniques & Applications
1Approximate Query ProcessingTechniques
Applications
- Presented by- Shraddha Rumade
2Road Map
- Introduction
- Different Data Synopsis
- Sampling
- Outlier Indexing
- Exploiting Workload Information (weighted
sampling) - Pre-computed samples
- Applications of AQP
- Sensor Networks
- Streams
3Introduction
- Decision support applications such as On Line
Analytical Processing (OLAP) and data mining for
analyzing large databases have become popular. - A common characteristic of these applications is
that they execute aggregation queries (count,
sum, avg) on large databases, which can often be
expensive and resource intensive. - Instead of obtaining exact answers to such
queries, giving approximate answers can greatly
benefit the scalability of these applications.
4Aggregate Query Example
SELECT Department, COUNT () as numStudents FROM
StudentRecords WHERE Degree Masters GROUP
BY Department ORDER BY numStudents DESC
Exact Answer
Approximate Answer
5Data Synopsis
- Pre-computed
- Online
- Different Synopsis
- Histograms - Partition attribute value domain
into a set of buckets. Become problematic when
dealing with high-dimensional data sets (storage
overhead construction cost). Require
substantial changes to the QP engine - Wavelets Provide a mathematical tool for the
hierarchical decomposition of functions.
Applications in signal image processing. - Samples pre-computed or online samples of the
data instead of the complete data to answer the
queries approximately.
6Sampling
- Uniform Random Sampling
- Sample size fN Sampling fraction f
- Each tuple is added to the sample with
probability f. - Problems
- Data skew - A skewed database is characterized by
the presence of outlier values that are
significantly different from the rest in terms of
their contribution to the aggregate. - Low selectivity Very few or no tuples in the
sample may satisfy the query predicate. - Solutions
- Outlier indexing
- Exploit workload information
7Outlier Indexing
- Identify the tuples with outlier values and store
them in a separate sub-relation. - Apply the query to the outlier values, determine
the true result of the query on the part of the
table which only includes the outlier values - Pick a uniform random sample from the table
excluding the outlier values (non-outliers), and
estimate an approximation to the true result of
the query if applied to the non-outlier tuples - Combine two results to obtain an overall estimate
of the querys true result. - NOTE the subset that minimizes the standard
deviation over the remaining set consists of the
leftmost t' elements (for 0 ? t' ? t ) and the
rightmost t - t' elements from the multiset R,
when the elements are arranged in a sorted order.
8Outlier Indexing
- Algorithm Outlier-Index (R, C, t )
- Read the values in column C of the relation R.
Let y1, y2,. yN be the sorted order of values
appearing in C. Each value corresponds to a
tuple. - For i 1 to t 1, compute
- E(i) S( yi, yi1,. yN )
- Let i' be the value of i where E(i) takes its
minimum value. - Then the outlier-index is the tuples that
correspond to the set of values - yj 1 ? j ? t' U yj N t'1- t ? j
? N where t' i' - 1
9Exploit workload information Weighted Sampling
- Key steps
- Perform sampling by taking into account weights
of tuples. Sample more from subsets of data that
are small in size but are important, i.e., have
high usage. - A tuple ti has weight wi if the tuple ti is
required to answer wi of the queries in the
workload. - Weight of the tuple ti in the relation be wi.
- normalized weight be wi' wi / ?Nj1 wj
- This tuple is accepted in the sample with
probability pi n wi' - The inverse of this probability is the
multiplication factor associated with the tuple
used while answering the query. Each aggregate
computed over this tuple gets multiplied by this
multiplication factor. - This technique proves better than uniform
sampling and weighted - sampling alone in case of different data skew,
different sampling fraction - and varying selectivity of queries.
10Pre-computing Samples
Use of pre-computed samples of the data instead
of the complete data to answer queries.
11Pre-computing Samples for Fixed Workload
For a given relation R and workload W, consider
partitioning the records in R into a minimum
number of regions R1, R2, , Rr such that for any
region Rj, each query in W selects either all
records in Rj or none.
12Fixed Workload
- Identify Fundamental Regions r
- Case 1 r lt k (k- sample size)
- 2. Pick exactly one record from each fundamental
region. - 3. Additional column RegionCount, AggSum) in the
sample records to store the aggregate value of
records in that fundamental region. - Case 2 r gt k
- 2. Sort all r regions by their importance and
then select the top k. The importance of region
Rj is defined as fjnj2, where fj is the sum of
the weights of all queries in W that select the
region, and nj is the number of records in the
region. - 3. optimization problem- We have 2k unknowns
RC1,,RCk and AS1, .ASk. MSE(W) can be
expressed as a quadratic function of these 2k
unknowns. - minimize this function to give 2k simultaneous
(sparse) linear - equations, solve using an iterative technique.
13Applications of AQP
14AQP in Sensor Networks
- A Sensor Network is a cluster of sensor motes,
devices with measurement, communication and
computation capabilities, powered by a small
battery. - In a typical sensor network, each sensor produces
a stream of sensory observations across one or
more sensing modalities. - Need for data aggregation
- unnecessary for each sensor to report its entire
data stream in full fidelity. - in a resource constrained sensor network
environment, each message transmission is a
significant, energy-expending operation. - individual readings may be noisy or unavailable.
- In sensor networks we need a scalable and fault
tolerant querying technique - to extract useful information from the data the
sensors collect.
15Sensor Database Systems - TinyDB
- Distributed query processor for smart sensor
devices. - Implements acquisitional techniques that can
provide significant reductions in power
consumption on our sensor devices. - Query dissemination - routing tree that allows a
basestation at the root of the network to
disseminate a query and collect query results. - ACQUISITIONAL QUERY LANGUAGE
- Event-Based queries
- Lifetime-Based queries
16AQP in Sensor NetworksTAG A Tiny AGgregation
Service for Ad-Hoc Sensor Networks
- Users connect to the sensor network using a
workstation or base station directly connected to
a sensor (sink). - Aggregate queries- SQL-like language, then
distributed across the network. - Aggregate results are sent back to the
workstation over a spanning tree, with each
sensor combining its own data with results
received from its children. - Effective and energy-efficient in case of no
failures, for distributive and algebraic
aggregates such as MIN, MAX, COUNT and AVG. - A single failure results in an entire sub-tree of
values being lost. - Multi-path routing
- Works for monotonic and exemplary aggregates like
MIN and MAX. - Incorrect results for duplicate sensitive
aggregates (COUNT, AVG)
17Duplicate Insensitive sketches combined with
multi-path routing
- Counting Sketches introduced by Flajolet
Martin (FM) - For quickly estimating the number of distinct
items in a database (or stream) in one pass while
using only a small amount of space. - Distinct counting problem From a multi-set of
items M x1, x2, x3, . . . , compute n
distinct (M) . - FM sketch
- Given a multi-set M, the FM sketch of M, denoted
S(M), is a bitmap of length k. The entries of
S(M), denoted S(M)0, . . . , k -1, are
initialized to zero and are set to one using a
random binary hash function h applied to the
elements of M. Formally, - S(M) i 1 iff ?x ? M s.t. min j h (x,
j) 1 i.
18Duplicate Insensitive sketches combined with
multi-path routing
- Summation Sketches
- Given a multi-set of items M x1, x2, x3, . . .
where xi (ki , ci) and ci is a non-negative
integer, the distinct summation problem is to
calculate - n ? ci
- distinct(( ki ,ci ) ? M )
- Algorithm SUMINSERT( S,x,c)
- 1 d pick_threshold (c)
- 2 for i 0, . . . , d - 1 do
- 3 Si 1
- 4 end for
- 5 a pick_binomial (seed(x, c), c, 1/2d)
- 6 for i 1, . . . , a do
- 7 j d
- 8 while hash(x,c,i,j) 0 do
- 9 j j 1
- 10 end while
- 11 Sj 1
- 12end for
19Duplicate Insensitive sketches combined with
multi-path routing
- Approximate estimation of Duplicate Sensitive
Aggregates - Wireless communication
- ability to broadcast a single message to multiple
neighbors simultaneously. - duplicate-insensitive sketches allow a sensor to
combine all of its received sketches into a
single message to be sent. - Algorithm
- The query is distributed across the sensor
network, using some form of flooding. Each node
also computes its level (i.e. its hop distance
from the root), and notes the level values of its
immediate neighbors. - divided into a series of epochs specified by the
query. - aggregate computed once for each epoch.
- At the beginning of each epoch, each node
constructs a sketch of its local values for the
aggregate.
20Duplicate Insensitive sketches combined with
multi-path routing
- Phase 2 (cont.)
- 3. The epoch is then sub-divided into a series of
rounds, one for each level, starting with the
highest (farthest) level. - 4. In each round, the nodes at the corresponding
level broadcast their sketches, and the nodes at
the next level receive these sketches and combine
them with their sketches in progress. - 5. In the last round, the sink receives the
sketches of its neighbors, and combines them to
produce the final aggregate.
21Duplicate Insensitive sketches combined with
multi-path routing
- Analysis
- Main advantage of synchronization and rounds-
better scheduling and reduced power consumption. - Loosening the synchronization increases the
robustness of the final aggregate as paths taking
more hops are used to route around failures. - Increased robustness comes at the cost of power
consumption, since nodes broadcast and receive
more often (due to values arriving later than
expected) and increased time (and variability) to
compute the final aggregate.
22Simple gossip-based protocols(AQP in Sensor
Networks cont.)
- We have seen that distributed systems prove
efficient over centralized ones, but with
distributed systems we have instability arising
due to node and link failures. - Sensor networks often involve the deployment in
inhospitable or inaccessible areas that are
naturally under high stress (for example in
battlefields or inside larger devices). - Individual sensors may fail at any time, and the
wireless network that connects them is highly
unreliable. - Decentralized gossip-based protocols provide a
simple and scalable solution for such highly
volatile systems along with fault-tolerant
information dissemination. - Due to the large scale of the system, the values
of aggregate functions over the data in the whole
network (or a large part of it) are often more
important than individual data at nodes.
23Analysis of simple gossip-based protocols(AQP in
Sensor Networks cont.)
- In a network of n nodes, each node i holds a
value xi (or a set Mi of values). - The idea is to compute some aggregate function of
these values (such as sums, averages,etc.) in a
decentralized and fault-tolerant fashion, while
using small messages only. - In gossip-based protocols, each node contacts one
or a few nodes in each round (usually chosen at
random), and exchanges information with these
nodes. - Information spread resembles the spread of an
epidemic.
24Analysis of simple gossip-based protocols(AQP in
Sensor Networks cont.)
- The Push-Sum protocol
- For computing sums or averages of values at the
nodes of a network. - At all times t, each node i maintains a sum st,i,
initialized to s0,i xi, and a weight wt,i,
initialized to w0,i 1. At time 0, it sends the
pair (s0,i,w0,i) to itself, and in each
subsequent time step t, each node i follows the
protocol given below- - Algorithm
- 1 Let (sr , wr) be all pairs sent to i in
round t-1 - 2 Let st,i Sr sr , wt,i Sr wr
- 3 Choose a target ft(i) uniformly at random
- 4 Send the pair ( ½ st,i , ½ wt,i ) to ft(i) and
i (yourself) - 5 st,i / wt,i is the estimate of the average in
step t - The algorithm uses the basic property of mass
conservation the average of all sums st,i is
always the correct average, and the sum of all
weights wt,i is always n.
25Analysis of simple gossip-based protocols(AQP in
Sensor Networks cont.)
- Diffusion Speeds
- The diffusion speed characterizes how fast a
value originating with any one node diffuses
evenly through the network. - Local n-dimensional contribution vector vt,i at
each node. Initially a node sends a vector ei
(the vector with 1 in the i-coordinate, and 0 in
all others) to itself. In subsequent rounds, the
protocol is - Algorithm Protocol Push-Vector
- 1 Let ?r be all vectors sent to i in round
t -1 - 2 Let vt,i Sr ?r
- 3 Choose shares at,i,j for all nodes j
- 4 Send at,i,j vt,i to each j
26Analysis of simple gossip-based protocols(AQP in
Sensor Networks cont.)
- Correspondence (Push-sum) st,I vt,i x Sj
vt,i,j xj wt,i vt,i 1 Sj
vt,i,j - Diffusion speed of the communication mechanism is
characterized by the speed with which the
contribution vectors converge to multiples of
the1 vector. - T T(d, n, e) is (an upper bound on) the
diffusion speed of the mechanism defined by the
distribution on shares at,i,j if maxi ?i,t lt e
with probability at least 1- d at all times t gt
T(d, n, e). - For uniform gossip it is O( log n log 1/ e log
1/ d)
27Analysis of simple gossip-based protocols(AQP in
Sensor Networks cont.)
- Importance
- Simplicity Gossip-based protocols usually do not
require error recovery mechanisms. - The guarantees obtained from gossip are usually
probabilistic in nature they achieve high
stability under stress and disruptions. - Scale gracefully to a huge number of nodes.
- Practical considerations
- The protocols Push-Sum, Push-Random, etc. are
defined in terms of synchronous rounds, and with
a synchronized starting point. The latter is
unnecessary. - Nodes will usually want to stop processing a
query after some time, when the approximation
guarantee is good enough. -
28Approximate Query processing in Streams
- Applications for high-speed streams
- Networking- IP network management, network packet
traces - Online monitoring real-time data over streams
such as call records, sensor readings, web usage
logs, etc. - Telecommunications
- Issues to consider
- Queries over these streams need to be processed
in an online fashion to - enable real-time responses.
- traditional DBMS paradigm of set-oriented
processing of disk-resident tuples does not
apply. - data streams produced at physically distributed
locations, - Adaptive, self-regulating systems for processing
continuous monitoring queries over data streams-
bursty data streams and variable data
characteristics.
29Adaptivity via Load Shedding Approximate Query
processing in Streams (cont.)
- Gracefully degrade performance when the demands
placed on the system cannot be met given
available resources, in the context of continuous
monitoring queries over data streams. - Load shedding dropping unprocessed tuples to
reduce system load. - Optimization problem
- objective function minimizing inaccuracy in
query answers - constraint system throughput must match or
exceed the data input rate. - General idea
- load shedding operators, or load shedders, at
various points in the query plan. - Each load shedder is parameterized by a sampling
rate p. The load shedder flips a coin for each
tuple that passes through it. With probability p,
the tuple is passed on to the next operator, and
with probability 1-p, the tuple is discarded. - lost tuples are compensated by scaling the
aggregate values calculated by the system to
produce unbiased approximate query answers.
30Adaptivity via Load Shedding Approximate Query
processing in Streams (cont.)
- For each query qi, there is a corresponding path
in the data flow diagram from some data stream Sj
through a set of query operatorsOi1,Oi2, . . . ,
Oip to node qi. This path represents the
processing necessary to compute the answer to
query qi, and it is called the query path for
query qi. - Set of trees. The root node - data stream Sj ,
and the leaf nodes- queries that monitor stream
Sj. Let T (Sj) denote the tree of operators
rooted at stream source Sj.
31Adaptivity via Load Shedding Approximate Query
processing in Streams (cont.)
- Parameters-
- Operator Oi (two parameters) selectivity si and
processing time per tuple ti. - SUM aggregate operator Oi two additional
parameters, the mean µi and standard
deviation si of values in input tuples - data stream Sj rate parameters ri
- Preliminaries
- Ui - set of operators upstream (Sj to Oi) of
OiIf some of the operators upstream of Oi are
selective, the data input rate seen by operator
Oi will be less than the data stream rate rj at
the stream source. - If load shedders are introduced upstream of Oi,
they will also reduce the effective input rate
seen by Oi. - pi- sampling rate of the load shedder introduced
immediately before operator Oi and let pi 1when
no such load shedder exists.
32Adaptivity via Load Shedding Approximate Query
processing in Streams (cont.)
- Load Shedding Algorithm
- Load Equation Any load shedding policy must
select sampling rates pi to ensure - ? (tirsrc(i)pi p
sxpx) 1 - 1 i k
Ox ? Ui - L.H.S- total time required for the system to
process the tuples that arrive during - one time unit (assumption- overhead introduced by
load shedding is negligible) - Problem statement-
- Given a data flow diagram, the parameters si, ti,
µ i, s i for each operator O i, and the rate - parameters rj for each data stream Sj ,
- select load shedding sampling rates pi to
minimize the max relative error ?max max1in
?i - subject to the constraint that the load equation,
must be satisfied.
33Adaptivity via Load Shedding Approximate Query
processing in Streams (cont.)
- Algorithm
- Determine the effective sampling rates for each
query that will distribute error evenly among all
queries. - Determine where in the data flow diagram load
shedding should be performed to achieve the
appropriate rates and satisfy the load equation. - Upper bound on probability that the relative
error exceeds a threshold ?i - Let X1,X2, . . .,XN be N random variables,
- Xj vj /P with probability P 0 otherwise.
- Âi -sum of random variables and let Ai
?Nj1 vj .SSi the sum ?Nj1 v2j , then - Pr Âi - Ai ? Ai 2 exp(-2 Pi 2 ?2
Ai2/SSi) - Thus, for a query qj, to ensure that the
probability that the relative error exceeds ?i is
at most d, we must guarantee 2 exp(-2P2 ?2
Ai2/SSi) d, which occurs when Pi?i Ci, where
Ci SQRT(SSi/2Ai2 log 2/d) - Thus we must guarantee that Pi Ci /?i
34Adaptivity via Load Shedding Approximate Query
processing in Streams (cont.)
- Placement of Load Shedders
- no sharing of operators among queries- introduce
a load shedder with sampling rate pi Pi before
the first operator in the query path for each qi. - Shared query path- Shared segment (Suppose we
label each operator with the set of all queries
that contain the operator in their query paths.
Then the set of all operators having the same
label is a shared segment) Load shedding is only
performed at the start of shared segments.
35Adaptivity via Load Shedding Approximate Query
processing in Streams (cont.)
Data flow diagram with 3 operators. Query nodes
q1 and q2 have effective sampling rates equal to
0.5 and 0.8. Imagine a solution that places load
shedders before all three operators A,B, and C
with sampling rates p1, p2, and p3 respectively.
Since p1p2 0.5 and p1p3 0.8, we know that the
ratio p2/p3 0.5/0.8 0.625 in any solution.
- Modification eliminate the load shedder before
operator C and change the sampling - rates for the other two load shedders to be p'1
p1p3 0.8 and p'2 p2/p3 0.625. - Thus p'1 p'2 p1p2 0.5 and p'1 p1p3 0.8,
but the resulting plan has lower processing - time per tuple. Effectively, we have pushed down
the savings from load shedder p3 to - before operator A, thereby reducing the effective
input rate to operator A while leaving all - other effective input rates unchanged.
36Adaptivity via Load Shedding Approximate Query
processing in Streams (cont.)
- Observations
- Optimal solution - the child segment of B that
lies on the query path for qmax will not contain
a load shedder. All other child segments of B
will contain a load shedder with sampling rate
Pchild/P max - Let qmax be the query that has the highest
effective sampling rate among all queries sharing
an initial segment S. In the optimal solution, S
will contain a load shedder with sampling rate
Pmax.
37Adaptivity via Load Shedding Approximate Query
processing in Streams (cont.)
- Algorithm SetSamplingRate (x,Rx)
- if x is a leaf node then
- return
- end if
- Let x1, x2, . . .xk be the children of x
- for i 1 to k do
- if Pxi lt Rx then
- Shed load with p Pxi /Rx on edge (x, xi)
- end if
- SetSamplingRate(xi, Pxi)
- end for
- Except for the first load shedder that is
introduced just after the root node, the sampling
rates for all others depend only on the ratios
between effective sampling rates - Effective sampling rate Pi Ci? where ? 1/
?max is an unknown multiplier. On each query
path, there is at most one load shedder whose
sampling rate depends on ? load equation becomes
a linear function of ?.
38Adaptivity via Load Shedding Approximate Query
processing in Streams (cont.)
Accuracy at various levels of load
Adjusting to variable stream rates
- Advantages of the algorithm
- Reduced error in the approximation of the query
- adapts to changes in system load over time, in
case of varied input stream rates
39References
- Surajit Chaudhuri, Gautam Das, Mayur Datar,
Rajeev Motwani, Vivek Narasayya Overcoming
Limitations of Sampling for Aggregation Queries.
ICDE 2001. - Surajit Chaudhuri, Gautam Das, Vivek Narasayya A
Robust, Optimization-Based Approach for
Approximate Answering of Aggregate Queries.
SIGMOD Conference 2001. - The design of an acquisitional query processor
for sensor networks Samuel Madden, Michael J.
Franklin, Joseph M. Hellerstein, Wei Hong June
2003 Â Proceedings of the 2003 ACM SIGMOD
international conference on Management of data. - Approximate aggregation techniques for sensor
databasesConsidine,J.Li,F.Kollios,G.Byers,J.
Data Engineering, 2004. Proceedings. 20th
International Conference on , 30 March-2 April
2004 - Gossip-based computation of aggregate
informationKempe,D.Dobra,A.Gehrke,J.
Foundations of Computer Science, 2003.
Proceedings. 44th Annual IEEE Symposium on
, 11-14 Oct. 2003 - Load shedding for aggregation queries over data
streamsBabcock,B.Datar,M.Motwani,R. Data
Engineering, 2004. Proceedings. 20th
International Conference on , 30 March-2 April
2004 - Samuel R. Madden, Michael J. Franklin, Joseph M.
Hellerstein, and Wei Hong. TAG a Tiny
AGgregation Service for Ad-Hoc Sensor Networks.
OSDI, December, 2002. - Gautam Das Survey of Approximate Query
Processing Techniques. (Invited Tutorial) SSDBM
2003