Space-Efficient%20Online%20Computation%20of%20Quantile%20Summaries - PowerPoint PPT Presentation

About This Presentation
Title:

Space-Efficient%20Online%20Computation%20of%20Quantile%20Summaries

Description:

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery – PowerPoint PPT presentation

Number of Views:190
Avg rating:3.0/5.0
Slides: 32
Provided by: emoryEdu
Category:

less

Transcript and Presenter's Notes

Title: Space-Efficient%20Online%20Computation%20of%20Quantile%20Summaries


1
Space-Efficient Online Computation of Quantile
Summaries
  • SIGMOD 01
  • Michael Greenwald Sanjeev Khanna
  • Presented by ellery

2
Outline
  • Introduction
  • The summary data structure
  • Operation and algorithm
  • Tree representation
  • Analysis and experimental result
  • Conclusion

3
Introduction
  • Space-efficient computation of quantile summaries
    of very large data sets in a single pass.
  • Quantile queries Given a quantile, ?, return the
    value whose rank is ??N?

4
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15
12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3
5
Requirements
  • Explicit tunable a priori guarantees on the
    precision of the approximation
  • As small a memory footprint as possible
  • Online Single pass over the data
  • Data Independent Performance guarantees should
    be unaffected by arrival order, distribution of
    values, or cardinality of observations.
  • Data Independent Setup no a priori knowledge
    required about data set (size, range,
    distribution, order).

6
e- approximate
  • A quantile summary for a data sequence is e-
    approximate if, for any given rank r, it returns
    a value whose rank r is guaranteed to be within
    the interval r -eN , r eN
  • Example A data stream with 100 elements,
  • 0.5 quantile with e 0.1 returns a value v.
  • The true rank of v is within 40,60

7
The Summary Data Structure
  • Let rmin(v) and rmax(v) denote the lower and
    upper bounds on the rank of v
  • Each tuple ti (vi , gi ,?i)

8
Example
??.01, N1750
28,7
10,1
15,2
192
201
204
501,503
539,540
529,536
9
Query
  • Sketch S is e- approximate, That is for each ?
    (0,1 , there is a (vi ,
    rmin(vi) , rmax(vi) ) in S such that
  • vi is our answer for ?-quantile

10
Corollary
  • If at any time n, the summary S(n) satisfies the
    property that
  • then we can answer any ?-quantile query to
    within an en precision.

11
Overview of Summary Data Structure
? .29
r ?N 522
??.01, N1800
15,2
28,7
10,1
192
201
204
529,536
539,540
501,503
  • Quantile ? .29? Compute r and choose best vi

12
Overview of Summary Data Structure
??.01, N1800
15,2
28,7
10,1
2?N36
192
201
204
529,536
539,540
501,503
  • If (rmax(vi1) - rmin(vi)) ? 2?N, then
    ?-approximate summary.
  • Our goal always maintain this property.
  • Tuple formulation of this rule gi ?I ? 2?N

13
Overview of Summary Data Structure
??.01, N1800
15,2
28,7
10,1
2?N36
192
201
204
539,540
501,503
529,536
  • Goal always maintain ?-approximate
    summary(rmax(vi1) - rmin(vi)) (gi ?I) ? 2?N
  • Insert new observations into summary

14
Overview of Summary Data Structure
??.01, N1800
15,2
28,7
10,1
2?N36
197
192
201
204
502,536
501,503
529,536
539,540
  • Goal always maintain ?-approximate
    summary(rmax(vi1) - rmin(vi)) (gi ?I) ? 2?N
  • Insert new observations into summary

15
Overview of Summary Data Structure
??.01, N1801
15,2
28,7
10,1
1,34
2?N36.02
197
192
204
201
502,536
530,537
540,541
501,503
  • Goal always maintain ?-approximate
    summary (rmax(vi1) - rmin(vi)) (gi ?I) ?
    2?N
  • Insert new observations into summary
  • Insert tuple before the ith tuple. gnew 1 ?new
    gi ?I - 1

16
Overview of Summary Data Structure
??.01, N1801
28,7
15,2
10,1
1,34
2?N36.02
197
192
201
204
502,536
540,541
530,537
501,503
  • Goal always maintain ?-approximate
    summary (rmax(vi1) - rmin(vi)) (gi ?I) ?
    2?N
  • Insert new observations into summary
  • Delete all superfluous entries.

17
Overview of Summary Data Structure
??.01, N1801
28,7
15,2
1,34
10,1
2?N36.02
192
201
204
530,537
540,541
501,503
  • Goal always maintain ?-approximate
    summary (rmax(vi1) - rmin(vi)) (gi ?I) ?
    2?N
  • Insert new observations into summary
  • Delete all superfluous entries.

18
Overview of Summary Data Structure
??.01, N1801
29,7
15,2
10,1
2?N36.02
192
201
204
530,537
540,541
501,503
  • Goal always maintain ?-approximate
    summary (rmax(vi1) - rmin(vi)) (gi ?I) ?
    2?N
  • Insert new observations into summary
  • Delete all superfluous entries. gi gi gi-1

19
Overview of Summary Data Structure
??.01, N1801
15,2
29,7
10,1
2?N36.02
192
201
204
501,503
530,537
540,541
  • Insert gnew 1 ?new gi ?I - 1
  • Delete gi gi gi-1

20
Terminology
  • Full tuple A tuple is full if gi ?I 2?N
  • Full tuple pair A pair of tuples is full if
    deleting the left-hand tuple would overfill the
    right one
  • Capacity number of observations that can be
    counted by gi before the tuple becomes full. (
    2?N - ?I)

General strategy will be to delete tuples with
small capacity and preserve tuples with large
capacity.
21
Operations
  • Insert(v)Find the smallest i, such that
  • , and insert
  • Delete(vi)to delete from S,
    replace and
    by the new tuple
  • Compress()from right to left, merge all
    mergeable pair.

22
GK Algorithm
To add the n1st observation, v, to summary S(n)
yes
no
COMPRESS()
INSERT
23
Tree Representation
?-range Capacity Band0-7 8-15 38-11 4-7 212-13
2-3 114 1 0
??.001, N7,000
2?N14
0
0
0
0
0
0
1
1
1
1
1
1
1
1
2
2
2
3
3
3
3
  • Group tuples with similar capacities into bands
  • First (least index) node to the right with higher
    capacity band becomes parent.

24
Tree Representation
?-range Capacity Band0-7 8-15 38-11 4-7 212-13
2-3 114 1 0
??.001, N7,000
2?N14
3
3
3
3
0
0
0
0
0
0
1
1
1
1
1
1
1
1
2
2
2
  • Group tuples with similar capacities into bands
  • First (least index) node to the right with higher
    capacity band becomes parent.

25
Tree Representation
?-range Capacity Band0-7 8-15 38-11 4-7 212-13
2-3 114 1 0
??.001, N7,000
2?N14
  • Group tuples with similar capacities into bands
  • First (least index) node to the right with higher
    capacity band becomes parent.

26
Tree Representation
?-range Capacity Band0-7 8-15 38-11 4-7 212-13
2-3 114 1 0
??.001, N7,000
2?N14
  • Group tuples with similar capacities into bands
  • First (least index) node to the right with higher
    capacity band becomes parent.

27
Operation (compress)
  • General strategy delete tuples with small
    capacity and preserve tuples with large capacity.
  • 1) Deletion cannot leave descendants unmerged ---
    it must delete entire subtrees
  • 2) Deletion can only merge a tuple with small
    capacity into a tuple with similar or larger
    capacity.
  • 3) Deletion cannot create an over-full tuple
    (i.e with g? gt floor(2?N))

28
Analysis
  • Theorem
  • At any time n, the total number of tuples
    stored in S(n) is at most

29
Experimental Result
  • Measurement
  • S
  • Observed ? (vs. desired ?) max, avg, and for 16
    representative quantiles
  • Optimal max observed ?
  • Compared 3 algorithms
  • MRL
  • Preallocated (1/3 number of stored observations
    as MRL)
  • Adaptive allocate a new quantile only when
    observed error is about to exceed desired ?

30
Conclusion
  • Better worst-case behavior than previous
    algorithms
  • It does not require a priori knowledge of the
    parameter N

31
Any Question ?
Write a Comment
User Comments (0)
About PowerShow.com