Progressive Approximate Aggregate Queries with a MultiResolution Tree Structure - PowerPoint PPT Presentation

About This Presentation

Title:

Progressive Approximate Aggregate Queries with a MultiResolution Tree Structure

Description:

Scan all points of D checking each against Q ... Offline estimation (Data Synopsis) Sampling, Histograms, Wavelets. Our Technique: ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 25

Provided by: Informatio367

Learn more at: https://ics.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: Progressive Approximate Aggregate Queries with a MultiResolution Tree Structure

1
Progressive Approximate Aggregate Queries with a
Multi-Resolution Tree Structure

Iosif Lazaridis, Sharad Mehrotra
University of California, Irvine
SIGMOD 2001, Santa Barbara

2
Talk Outline

Aggregate Queries
Motivation for Approximate Answering
Multi-Resolution Aggregate Tree (MRA-Tree)
Progressive Algorithm with Error Bounds
Experimental Evaluation
Summary and Future Work

3
Aggregate Queries
minQ 2 maxQ 7 countQ 3 sumQ 276
15 avgQ 15/3 5
S
4
Evaluating Aggregate Queries

Exact answering
Scan all points of D checking each against Q
Retrieve points in Q via a multi-dimensional
index on D
Both linear/index scan can be very expensive
Approximate answering
Many applications (selectivity estimation, data
analysis, visualization) do not require exact
answers

5
Motivating Examples
How many tanks 10 miles from me?
My boss needs to see the income aggregates in 10
minutes!
6
Techniques for Approximate Aggregate Queries

Online estimation (Interactive)
Sampling
Offline estimation (Data Synopsis)
Sampling, Histograms, Wavelets
Our Technique
Online estimator via a scan of a modified
multi-dimensional index (MRA-Tree)
Allows incremental tradeoff of accuracy for
response time, with guaranteed error bounds

7
Multi-Resolution Aggregate Tree (MRA-Tree)

An MRA-Tree can be instantiated with any of the
popular multi-dimensional index trees (R-Tree,
quadtree, Hybrid tree, etc.)
A non-leaf node contains (for each of its
subtrees) four aggregates MIN,MAX,COUNT,SUM
A leaf node contains the actual data points
Tree operations are identical with those of the
plain (non-MRA) tree with the consideration that
aggregates must be maintained

8
MRA-Tree Example
2
min max count sum
4
1
6
Non-Leaf Node
4
5
2
6
3
2
4
1
9
9
4
6
Leaf Nodes
9
Progressive Algorithm Outline

We want
Best answer for given time
Shortest time for given precision of the answer
Refine an answer at will, trading time for
precision

How we achieve it
Do a prioritized traversal of nodes of the
MRA-tree
Maintain an estimate of the answer E(aggQ)
Maintain a 100 interval of confidence I L,
H, such that L ? aggQ ? H

10
Generic Algorithm (1)

Two sets of nodes
NP (partial contribution to the query)
NC (complete contribution)

11
Generic Algorithm (2)

Initialize NP with the root
At each iteration Remove one node N from NP and
for each Nchild of its children
discard, if Nchild disjoint with Q
insert into NP if Q is contained or partially
overlaps with Nchild
insert into NC if Q contains Nchild (we only
need to maintain aggNC)

N
Q
12
Generic Algorithm (3)

To instantiate the algorithm for
MIN,MAX,COUNT,SUM,AVG
Error Bounds.
Interval IL, H L ? aggQ ? H
Traversal Policy.
Which node from NP to explore next? Minimize I
Estimation.
Provide an estimate of the answer E(aggQ)

Node in NP
Node in NC
13
MIN (and MAX)
Interval minNC min 4, 5 4 minNP min
3, 9 3 L min minNC, minNP 3 H minNC
4 hence, I 3, 4
9
4
5
Estimate Lower bound E(minQ) L 3
3
14
COUNT (and SUM)
Interval countNC 96 15 countNP 810 18 L
countNC 15 H countNC countNP 33 hence,
I 15, 33
8
25
9
Estimate E(countQ) L 0.25?8 0.2?10 19
6
20
10
15
AVG
Interval Current avgNC 55/10 5.5
B
Maximum possible (552?10) / (102)
6.25 Minimum possible (553?5) / (103)
5.38 hence, I 5.38, 6.25
A
Estimate E(avgQ) E(sumQ)/ E(countQ)
Distribution of Values 5, 5, 5, 10, 10
Traversal max countN max (maxN-avgNC),
(avgNC-minN)
min max count sum A 5 10
5 35 B 10
55
16
Experiments

Synthetic datasets 2-4D
Real datasets 2D spatial (USGS) and 4D (UCI KDD
Forest Cover)
MRA-quadtree and MRA-Rtree indices
We study
MRA-tree Vs. plain tree
MRA-tree Vs. online sampling
Accuracy of estimation
Scalability with database size

17
MRA-Quadtree (Nodes Visited)
18
MRA-Quadtree (Error Reduction)
Absolute Relative Error
19
MRA-Rtree (2D, USGS) I/O Performance
DB Size
20
Estimation vs. Maximum Error (4D, Forest Cover,
sel. 16 / axis)
21
MRA-Rtree vs. Online SamplingEstimation Accuracy
(4D, Forest Cover)
22
Database Size (3D Synthetic, exact, 10 spatial
sel.)
23
Summary

MRA-Tree is a modified multi-dimensional index
for approximate answering of aggregate queries
For exact answer
faster than plain index
Advantages over offline estimators
Progressively improving answers
Error bounds
Advantages over sampling
Better estimate for same I/O
Algorithm scales gracefully with database size

24
Future Work (QUASAR Project, UC Irvine)

Scalability with high dimensionality, by using a
dedicated high-D index structure
Scalability in high update rate environments
Approximate query processing of general SQL
queries using dedicated data structures, similar
to MRA-tree

Write a Comment

User Comments (0)