Title: Space-Efficient%20Online%20Computation%20of%20Quantile%20Summaries
1Space-Efficient Online Computation of Quantile
Summaries
- SIGMOD 01
- Michael Greenwald Sanjeev Khanna
- Presented by ellery
2Outline
- Introduction
- The summary data structure
- Operation and algorithm
- Tree representation
- Analysis and experimental result
- Conclusion
3Introduction
- Space-efficient computation of quantile summaries
of very large data sets in a single pass. - Quantile queries Given a quantile, ?, return the
value whose rank is ??N?
4t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15
12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3
5Requirements
- Explicit tunable a priori guarantees on the
precision of the approximation - As small a memory footprint as possible
- Online Single pass over the data
- Data Independent Performance guarantees should
be unaffected by arrival order, distribution of
values, or cardinality of observations. - Data Independent Setup no a priori knowledge
required about data set (size, range,
distribution, order).
6e- approximate
- A quantile summary for a data sequence is e-
approximate if, for any given rank r, it returns
a value whose rank r is guaranteed to be within
the interval r -eN , r eN -
- Example A data stream with 100 elements,
- 0.5 quantile with e 0.1 returns a value v.
- The true rank of v is within 40,60
7The Summary Data Structure
- Let rmin(v) and rmax(v) denote the lower and
upper bounds on the rank of v - Each tuple ti (vi , gi ,?i)
-
-
-
-
8Example
??.01, N1750
28,7
10,1
15,2
192
201
204
501,503
539,540
529,536
9Query
- Sketch S is e- approximate, That is for each ?
(0,1 , there is a (vi ,
rmin(vi) , rmax(vi) ) in S such that - vi is our answer for ?-quantile
10 Corollary
- If at any time n, the summary S(n) satisfies the
property that - then we can answer any ?-quantile query to
within an en precision.
11Overview of Summary Data Structure
? .29
r ?N 522
??.01, N1800
15,2
28,7
10,1
192
201
204
529,536
539,540
501,503
- Quantile ? .29? Compute r and choose best vi
12Overview of Summary Data Structure
??.01, N1800
15,2
28,7
10,1
2?N36
192
201
204
529,536
539,540
501,503
- If (rmax(vi1) - rmin(vi)) ? 2?N, then
?-approximate summary. - Our goal always maintain this property.
- Tuple formulation of this rule gi ?I ? 2?N
13Overview of Summary Data Structure
??.01, N1800
15,2
28,7
10,1
2?N36
192
201
204
539,540
501,503
529,536
- Goal always maintain ?-approximate
summary(rmax(vi1) - rmin(vi)) (gi ?I) ? 2?N - Insert new observations into summary
14Overview of Summary Data Structure
??.01, N1800
15,2
28,7
10,1
2?N36
197
192
201
204
502,536
501,503
529,536
539,540
- Goal always maintain ?-approximate
summary(rmax(vi1) - rmin(vi)) (gi ?I) ? 2?N - Insert new observations into summary
15Overview of Summary Data Structure
??.01, N1801
15,2
28,7
10,1
1,34
2?N36.02
197
192
204
201
502,536
530,537
540,541
501,503
- Goal always maintain ?-approximate
summary (rmax(vi1) - rmin(vi)) (gi ?I) ?
2?N - Insert new observations into summary
- Insert tuple before the ith tuple. gnew 1 ?new
gi ?I - 1
16Overview of Summary Data Structure
??.01, N1801
28,7
15,2
10,1
1,34
2?N36.02
197
192
201
204
502,536
540,541
530,537
501,503
- Goal always maintain ?-approximate
summary (rmax(vi1) - rmin(vi)) (gi ?I) ?
2?N - Insert new observations into summary
- Delete all superfluous entries.
17Overview of Summary Data Structure
??.01, N1801
28,7
15,2
1,34
10,1
2?N36.02
192
201
204
530,537
540,541
501,503
- Goal always maintain ?-approximate
summary (rmax(vi1) - rmin(vi)) (gi ?I) ?
2?N - Insert new observations into summary
- Delete all superfluous entries.
18Overview of Summary Data Structure
??.01, N1801
29,7
15,2
10,1
2?N36.02
192
201
204
530,537
540,541
501,503
- Goal always maintain ?-approximate
summary (rmax(vi1) - rmin(vi)) (gi ?I) ?
2?N - Insert new observations into summary
- Delete all superfluous entries. gi gi gi-1
19Overview of Summary Data Structure
??.01, N1801
15,2
29,7
10,1
2?N36.02
192
201
204
501,503
530,537
540,541
- Insert gnew 1 ?new gi ?I - 1
- Delete gi gi gi-1
20Terminology
- Full tuple A tuple is full if gi ?I 2?N
- Full tuple pair A pair of tuples is full if
deleting the left-hand tuple would overfill the
right one - Capacity number of observations that can be
counted by gi before the tuple becomes full. (
2?N - ?I)
General strategy will be to delete tuples with
small capacity and preserve tuples with large
capacity.
21Operations
- Insert(v)Find the smallest i, such that
- , and insert
- Delete(vi)to delete from S,
replace and
by the new tuple - Compress()from right to left, merge all
mergeable pair. -
22GK Algorithm
To add the n1st observation, v, to summary S(n)
yes
no
COMPRESS()
INSERT
23Tree Representation
?-range Capacity Band0-7 8-15 38-11 4-7 212-13
2-3 114 1 0
??.001, N7,000
2?N14
0
0
0
0
0
0
1
1
1
1
1
1
1
1
2
2
2
3
3
3
3
- Group tuples with similar capacities into bands
- First (least index) node to the right with higher
capacity band becomes parent.
24Tree Representation
?-range Capacity Band0-7 8-15 38-11 4-7 212-13
2-3 114 1 0
??.001, N7,000
2?N14
3
3
3
3
0
0
0
0
0
0
1
1
1
1
1
1
1
1
2
2
2
- Group tuples with similar capacities into bands
- First (least index) node to the right with higher
capacity band becomes parent.
25Tree Representation
?-range Capacity Band0-7 8-15 38-11 4-7 212-13
2-3 114 1 0
??.001, N7,000
2?N14
- Group tuples with similar capacities into bands
- First (least index) node to the right with higher
capacity band becomes parent.
26Tree Representation
?-range Capacity Band0-7 8-15 38-11 4-7 212-13
2-3 114 1 0
??.001, N7,000
2?N14
- Group tuples with similar capacities into bands
- First (least index) node to the right with higher
capacity band becomes parent.
27Operation (compress)
- General strategy delete tuples with small
capacity and preserve tuples with large capacity.
- 1) Deletion cannot leave descendants unmerged ---
it must delete entire subtrees - 2) Deletion can only merge a tuple with small
capacity into a tuple with similar or larger
capacity. - 3) Deletion cannot create an over-full tuple
(i.e with g? gt floor(2?N))
28Analysis
- Theorem
- At any time n, the total number of tuples
stored in S(n) is at most -
29Experimental Result
- Measurement
- S
- Observed ? (vs. desired ?) max, avg, and for 16
representative quantiles - Optimal max observed ?
- Compared 3 algorithms
- MRL
- Preallocated (1/3 number of stored observations
as MRL) - Adaptive allocate a new quantile only when
observed error is about to exceed desired ?
30Conclusion
- Better worst-case behavior than previous
algorithms - It does not require a priori knowledge of the
parameter N -
-
31Any Question ?