Title: Progressive Approximate Aggregate Queries with a MultiResolution Tree Structure
1Progressive Approximate Aggregate Queries with a
Multi-Resolution Tree Structure
- Iosif Lazaridis, Sharad Mehrotra
- University of California, Irvine
- SIGMOD 2001, Santa Barbara
2Talk Outline
- Aggregate Queries
- Motivation for Approximate Answering
- Multi-Resolution Aggregate Tree (MRA-Tree)
- Progressive Algorithm with Error Bounds
- Experimental Evaluation
- Summary and Future Work
3Aggregate Queries
minQ 2 maxQ 7 countQ 3 sumQ 276
15 avgQ 15/3 5
S
4Evaluating Aggregate Queries
- Exact answering
- Scan all points of D checking each against Q
- Retrieve points in Q via a multi-dimensional
index on D - Both linear/index scan can be very expensive
- Approximate answering
- Many applications (selectivity estimation, data
analysis, visualization) do not require exact
answers
5Motivating Examples
How many tanks 10 miles from me?
My boss needs to see the income aggregates in 10
minutes!
6Techniques for Approximate Aggregate Queries
- Online estimation (Interactive)
- Sampling
- Offline estimation (Data Synopsis)
- Sampling, Histograms, Wavelets
- Our Technique
- Online estimator via a scan of a modified
multi-dimensional index (MRA-Tree) - Allows incremental tradeoff of accuracy for
response time, with guaranteed error bounds
7Multi-Resolution Aggregate Tree (MRA-Tree)
- An MRA-Tree can be instantiated with any of the
popular multi-dimensional index trees (R-Tree,
quadtree, Hybrid tree, etc.) - A non-leaf node contains (for each of its
subtrees) four aggregates MIN,MAX,COUNT,SUM - A leaf node contains the actual data points
- Tree operations are identical with those of the
plain (non-MRA) tree with the consideration that
aggregates must be maintained
8MRA-Tree Example
2
min max count sum
4
1
6
Non-Leaf Node
4
5
2
6
3
2
4
1
9
9
4
6
Leaf Nodes
9Progressive Algorithm Outline
- We want
- Best answer for given time
- Shortest time for given precision of the answer
- Refine an answer at will, trading time for
precision
- How we achieve it
- Do a prioritized traversal of nodes of the
MRA-tree - Maintain an estimate of the answer E(aggQ)
- Maintain a 100 interval of confidence I L,
H, such that L ? aggQ ? H
10Generic Algorithm (1)
- Two sets of nodes
- NP (partial contribution to the query)
- NC (complete contribution)
11Generic Algorithm (2)
- Initialize NP with the root
- At each iteration Remove one node N from NP and
for each Nchild of its children - discard, if Nchild disjoint with Q
- insert into NP if Q is contained or partially
overlaps with Nchild - insert into NC if Q contains Nchild (we only
need to maintain aggNC)
N
Q
12Generic Algorithm (3)
- To instantiate the algorithm for
MIN,MAX,COUNT,SUM,AVG - Error Bounds.
- Interval IL, H L ? aggQ ? H
- Traversal Policy.
- Which node from NP to explore next? Minimize I
- Estimation.
- Provide an estimate of the answer E(aggQ)
Node in NP
Node in NC
13MIN (and MAX)
Interval minNC min 4, 5 4 minNP min
3, 9 3 L min minNC, minNP 3 H minNC
4 hence, I 3, 4
9
4
5
Estimate Lower bound E(minQ) L 3
3
14COUNT (and SUM)
Interval countNC 96 15 countNP 810 18 L
countNC 15 H countNC countNP 33 hence,
I 15, 33
8
25
9
Estimate E(countQ) L 0.25?8 0.2?10 19
6
20
10
15AVG
Interval Current avgNC 55/10 5.5
B
Maximum possible (552?10) / (102)
6.25 Minimum possible (553?5) / (103)
5.38 hence, I 5.38, 6.25
A
Estimate E(avgQ) E(sumQ)/ E(countQ)
Distribution of Values 5, 5, 5, 10, 10
Traversal max countN max (maxN-avgNC),
(avgNC-minN)
min max count sum A 5 10
5 35 B 10
55
16Experiments
- Synthetic datasets 2-4D
- Real datasets 2D spatial (USGS) and 4D (UCI KDD
Forest Cover) - MRA-quadtree and MRA-Rtree indices
- We study
- MRA-tree Vs. plain tree
- MRA-tree Vs. online sampling
- Accuracy of estimation
- Scalability with database size
17MRA-Quadtree (Nodes Visited)
18MRA-Quadtree (Error Reduction)
Absolute Relative Error
19MRA-Rtree (2D, USGS) I/O Performance
DB Size
20Estimation vs. Maximum Error (4D, Forest Cover,
sel. 16 / axis)
21MRA-Rtree vs. Online SamplingEstimation Accuracy
(4D, Forest Cover)
22Database Size (3D Synthetic, exact, 10 spatial
sel.)
23Summary
- MRA-Tree is a modified multi-dimensional index
for approximate answering of aggregate queries - For exact answer
- faster than plain index
- Advantages over offline estimators
- Progressively improving answers
- Error bounds
- Advantages over sampling
- Better estimate for same I/O
- Algorithm scales gracefully with database size
24Future Work (QUASAR Project, UC Irvine)
- Scalability with high dimensionality, by using a
dedicated high-D index structure - Scalability in high update rate environments
- Approximate query processing of general SQL
queries using dedicated data structures, similar
to MRA-tree