Title: Join%20Synopses%20for%20Approximate%20Query%20Answering
1Join Synopses for Approximate Query Answering
- Swarup Achrya
- Philip B. Gibbons
- Viswanath Poosala
- Sridhar Ramaswamy
- Presented by
- Bhushan Pachpande
2Contents
- Introduction
- Need for approximate answers
- Problem with joins
- Join synopses
- Allocation
- Maintenance of join synopses
- Experimental Evaluation
3Introduction
- This paper
- demonstrates difficulty of providing good
approximate answers to join queries - proposes join synopses as the efficient solution
for this problem - presents strategy for allocating available space
for join synopses - provides efficient algorithm for maintaining join
synopses in presence of updates to the base
relations
4Why Approximate answers ?
- Reduce overhead for large DBs and improve
response time - Reduce access to the base relation
- Example of Approximate answers
- Initial queries in the data mining which are used
to determine what the interesting queries are - Queries requesting numerical answers and full
precision of exact answer not needed e.g. total,
average - The research in this paper was conducted while
developing efficient approximate query answering
system, Aqua.
5Aqua System
- improve response time by avoiding frequent access
to original data - maintains smaller sized statistical summaries,
called synopses, on warehouse. - sits on the top of the DBMS.
Collects all synopses, uses it to answer queries
posed by user
- There key components
- Statistic Collection
- Query Rewriting
- Maintenance
parses sql input, rewrite queries for scaling
certain operators to fit for synopses
Keep synopses up to date during updating of
original data
6Problem with Joins
- Natural set of synopses for an approximate query
includes uniform random samples of each base
relations - Non-uniform result samples - For the join to be
uniform random sample, probabilities of tuples in
join samples must be equal - Small join result
Join of relations R S on attribute X
Probabilities of tuples a1 and a2 being
selected should be same as prob. of tuples a1
and b1 selected in join
R.X
S.X
a a b b
a b
a1
a2
prob. (a1,a2)(1/r)(1/r)(1/r)1/r3
b1
prob. (a1,b1)(1/r)(1/r)(1/r) (1/r)1/r4
To get uniform join samples is very difficult
uniform random sampling
7Join Synopses
- Naïve way - execute all possible join queries and
collect samples - Join synopses - samples are taken from small set
of distinguished joins - Can obtain random samples of all possible joins
in the schema - This is scheme is for foreign key joins
- Modeled database schema as a graph
- vertex - base relation
- directed edge (u to v) if u has at least one
attribute which is foreign key in v
8Join Synopses
- Key result proved - There is 1-1 correspondence
between a tuple in relation r a tuple in the
output of any foreign key join involving r
any of its descendents in the graph. - A sample Sr of a relation r can be used to
produce another relation J(Sr) called a join
synopsis of r. ( provides random samples). - Join synopses of R is simply a sample of R where
as for C it is the join of N, R and sample of C.
9Join Synopses
- For each node u in database schema G,
corresponding to a relation r1, define J(u) to be
the output of the maximum foreign key join
r1xr2x..xrk with source r1. - Let Su be a uniform random sample of r1.
- The join synopsis J(Su) is the output of
Suxr2xr3..xrk. - J(Su) is a uniform random sample of J(u) with
Su tuples. - Thus we can extract from our synopsis a uniform
random sample of the output of any k-way foreign
key join. - From 1 join synopsis for a node whose foreign key
join has k relations, we can extract a URS of the
output of between k-1 pow(2,k-1)-1 distinct
foreign key joins.
10Allocation
- Allocate space among various join synopses when
certain properties of query workload are known. - Identify heuristics for the common case when such
properties are not known. - Let S be a set of queries with selects,
aggregates, group bys foreign key joins. - For each relation Ri, find fraction Fi of queries
in S for which Ri is the source relation in a
foreign key join. - It is known that the error bounds are inversely
proportional to sqrt(n).(n- number of tuples in
join sample). - Select join synopsis sizes so as to minimize the
average relative error.
11Allocation
- The average relative error bound over the queries
is proportional to sum(fi/sqrt(ni)) - ni is selected so as to minimize the above
equation for the total memory allocated for join
synopses - For each relation Ri if si size of single join
synopses tuple then join synopses size is chosen
so as -
- sum (nisi) lt Total memory allocated
- In the absence of query work load information
heuristic strategies can be used. - EqJoin
- CubeJoin
- PropJoin
divides space equally amongst the relations
divides space proportional to their join synopses
tuple sizes
divides space proportional to cube root of their
join synopses tuple sizes
12Maintenance of Join Synopses
- Need to maintain the join synopses when base
relation is updated (insert or delete) - does not require frequent access to base relation
- If a new tuple is inserted
- Let Pu be the probability of newly arrived tuple
for relation u in random sample Su - Let uxr2xr3x.xrk be the max foreign key join
with source u. - We add T (new tuple) to Su with probability Pu.
- If T is added to Su, we add to J(Su) the tuple
Txr2xr3x.rk
13Maintenance of Join Synopses
- If T is added to Su and Su exceeds its target
size, then select uniformly at random a tuple T
to evict from Su and remove the tuple in J(Su)
corresponding to T. - On delete of a tuple T from u
- T is in Su delete the tuple from Su and remove
the tuple from J(Su) corresponding to T - If sample becomes too small due to many deletions
repopulate by scanning relation u. - This algorithm performs lookups with the base
relation with small probability Pu
14Experimental Evaluation
- Two classes of experiments
- Accuracy experiments
- Maintenance experiments
- Accuracy Experiments
- Compares accuracy of techniques based on join
synopses and based on base samples - parameters varied - query selectivity and total
space allocated to precomputed summaries (summary
size/join synopses size) - Maintenance Experiments
- Study cost of keeping join synopses up to date in
presence of insertions/deletions to the
underlying data.
15Experimental Evaluation
- Ran results on TPC-D decision support benchmark
- Query used is an aggregate that is computed on
join of Lineitem, Customer, Order, Supplier,
Nation and Region. - The query used is
Query selectivity is varied using these parameters
- region parameter is set to ASIA and selection
predicate is on o_orderdate column to the range
1/1/94, 1/1/95
16Experimental Evaluation
Accuracy Experiments
17Experimental Evaluation
Maintenance Experiments
tuples inserted in lineitem table
18Conclusion
- Provides uniform random sampling for joins in the
database having foreign key joins. - Focus on computing approximate answers to
aggregates computed on multi-way joins. - Join synopses can be maintained effectively
during updates.