Progressive Approximate Aggregate Queries with a MultiResolution Tree Structure - PowerPoint PPT Presentation

About This Presentation
Title:

Progressive Approximate Aggregate Queries with a MultiResolution Tree Structure

Description:

Scan all points of D checking each against Q ... Offline estimation (Data Synopsis) Sampling, Histograms, Wavelets. Our Technique: ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 25
Provided by: Informatio367
Learn more at: https://ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: Progressive Approximate Aggregate Queries with a MultiResolution Tree Structure


1
Progressive Approximate Aggregate Queries with a
Multi-Resolution Tree Structure
  • Iosif Lazaridis, Sharad Mehrotra
  • University of California, Irvine
  • SIGMOD 2001, Santa Barbara

2
Talk Outline
  • Aggregate Queries
  • Motivation for Approximate Answering
  • Multi-Resolution Aggregate Tree (MRA-Tree)
  • Progressive Algorithm with Error Bounds
  • Experimental Evaluation
  • Summary and Future Work

3
Aggregate Queries
minQ 2 maxQ 7 countQ 3 sumQ 276
15 avgQ 15/3 5
S
4
Evaluating Aggregate Queries
  • Exact answering
  • Scan all points of D checking each against Q
  • Retrieve points in Q via a multi-dimensional
    index on D
  • Both linear/index scan can be very expensive
  • Approximate answering
  • Many applications (selectivity estimation, data
    analysis, visualization) do not require exact
    answers

5
Motivating Examples
How many tanks 10 miles from me?
My boss needs to see the income aggregates in 10
minutes!
6
Techniques for Approximate Aggregate Queries
  • Online estimation (Interactive)
  • Sampling
  • Offline estimation (Data Synopsis)
  • Sampling, Histograms, Wavelets
  • Our Technique
  • Online estimator via a scan of a modified
    multi-dimensional index (MRA-Tree)
  • Allows incremental tradeoff of accuracy for
    response time, with guaranteed error bounds

7
Multi-Resolution Aggregate Tree (MRA-Tree)
  • An MRA-Tree can be instantiated with any of the
    popular multi-dimensional index trees (R-Tree,
    quadtree, Hybrid tree, etc.)
  • A non-leaf node contains (for each of its
    subtrees) four aggregates MIN,MAX,COUNT,SUM
  • A leaf node contains the actual data points
  • Tree operations are identical with those of the
    plain (non-MRA) tree with the consideration that
    aggregates must be maintained

8
MRA-Tree Example
2
min max count sum
4
1
6
Non-Leaf Node
4
5
2
6
3
2
4
1
9
9
4
6
Leaf Nodes
9
Progressive Algorithm Outline
  • We want
  • Best answer for given time
  • Shortest time for given precision of the answer
  • Refine an answer at will, trading time for
    precision
  • How we achieve it
  • Do a prioritized traversal of nodes of the
    MRA-tree
  • Maintain an estimate of the answer E(aggQ)
  • Maintain a 100 interval of confidence I L,
    H, such that L ? aggQ ? H

10
Generic Algorithm (1)
  • Two sets of nodes
  • NP (partial contribution to the query)
  • NC (complete contribution)

11
Generic Algorithm (2)
  • Initialize NP with the root
  • At each iteration Remove one node N from NP and
    for each Nchild of its children
  • discard, if Nchild disjoint with Q
  • insert into NP if Q is contained or partially
    overlaps with Nchild
  • insert into NC if Q contains Nchild (we only
    need to maintain aggNC)

N
Q
12
Generic Algorithm (3)
  • To instantiate the algorithm for
    MIN,MAX,COUNT,SUM,AVG
  • Error Bounds.
  • Interval IL, H L ? aggQ ? H
  • Traversal Policy.
  • Which node from NP to explore next? Minimize I
  • Estimation.
  • Provide an estimate of the answer E(aggQ)

Node in NP
Node in NC
13
MIN (and MAX)
Interval minNC min 4, 5 4 minNP min
3, 9 3 L min minNC, minNP 3 H minNC
4 hence, I 3, 4
9
4
5
Estimate Lower bound E(minQ) L 3
3
14
COUNT (and SUM)
Interval countNC 96 15 countNP 810 18 L
countNC 15 H countNC countNP 33 hence,
I 15, 33
8
25
9
Estimate E(countQ) L 0.25?8 0.2?10 19
6
20
10
15
AVG
Interval Current avgNC 55/10 5.5
B
Maximum possible (552?10) / (102)
6.25 Minimum possible (553?5) / (103)
5.38 hence, I 5.38, 6.25
A
Estimate E(avgQ) E(sumQ)/ E(countQ)
Distribution of Values 5, 5, 5, 10, 10
Traversal max countN max (maxN-avgNC),
(avgNC-minN)
min max count sum A 5 10
5 35 B 10
55
16
Experiments
  • Synthetic datasets 2-4D
  • Real datasets 2D spatial (USGS) and 4D (UCI KDD
    Forest Cover)
  • MRA-quadtree and MRA-Rtree indices
  • We study
  • MRA-tree Vs. plain tree
  • MRA-tree Vs. online sampling
  • Accuracy of estimation
  • Scalability with database size

17
MRA-Quadtree (Nodes Visited)
18
MRA-Quadtree (Error Reduction)
Absolute Relative Error
19
MRA-Rtree (2D, USGS) I/O Performance
DB Size
20
Estimation vs. Maximum Error (4D, Forest Cover,
sel. 16 / axis)
21
MRA-Rtree vs. Online SamplingEstimation Accuracy
(4D, Forest Cover)
22
Database Size (3D Synthetic, exact, 10 spatial
sel.)
23
Summary
  • MRA-Tree is a modified multi-dimensional index
    for approximate answering of aggregate queries
  • For exact answer
  • faster than plain index
  • Advantages over offline estimators
  • Progressively improving answers
  • Error bounds
  • Advantages over sampling
  • Better estimate for same I/O
  • Algorithm scales gracefully with database size

24
Future Work (QUASAR Project, UC Irvine)
  • Scalability with high dimensionality, by using a
    dedicated high-D index structure
  • Scalability in high update rate environments
  • Approximate query processing of general SQL
    queries using dedicated data structures, similar
    to MRA-tree
Write a Comment
User Comments (0)
About PowerShow.com