Title: How to Summarize the Universe: Dynamic Maintenance of Quantiles
1How to Summarize the UniverseDynamic
Maintenance of Quantiles
- By
- Anna C. Gilbert
- Yannis Kotidis
- S. Muthukrishnan
- Martin J. Strauss
2Quantiles
- Median, quartiles,
- The general case
- Uses
- Statistics
- Estimating result set size
- Partitioning
3Computing static quantiles
- Blum, Floyd, Pratt, Rivest Tarjan
- Find the ith element
- Comparison based
- Similar to QuickSort
- O(n) worst case time
4Problems with massive data sets
- O(n) time not good enough
- O(n) space usually not affordable
- Dynamic environment
- Cancellations are especially troublesome
- Usually recomputed periodically
- May be very inaccurate until recomputed
Some kind of approximation is the only choice !
5Common approaches
- Deterministically chosen sample
- Randomization probability of failure
- Maintaining a backing sample
- Wavelets
- Most of the above approaches work well for the
incremental case, but deletions may cause
inaccuracy.
6GK Greenwald-Khanna (01)
- Fill the available memory with values
- Maintain rank ranges on values is memory.
- When a new value is inserted, kick a value out of
memory. - Insert-only algorithm
- Can be extended to support deletes (GK2).
- Maintain two instances one for insertions and
one for deletions.
7Maintenance of Equi-Depth Histograms (using a
backing sample)
- Gibbons, Matias, Poosala 97
- Scan the dataset and choose values for the sample
using the reservoir method. - Treat insertions as a continuous scan.
- When a deletion from the sample is necessary
rescan only if number of items drops below a
specified minimum. - Works well for a mostly-insertions enviornment.
8The authors main result
- The RSS algorithm
- RSS Random Subset Sum
- Space polylogarithmic in universe size
- Proportional time
- A priori guarantee of accuracy within a user
specified error e, with a user specified
probability of failure d.
9Some formalism
- The universe U 0, , U -1
- Number of tuples in data set AN
- Data set can be thought of as an arrayAi
number of tuples with value i - Our goal for computing ?-quantiles find a jk
such that
10Some assumptions
- The universes size is known
- Later well throw that assumption away
- Update Delete Insert
11Computing quantiles
- Lets say Ai is known for every i.
- Easy to maintain through updates
- Summing up array items ?
- Not a very good complexity
12Computing quantiles (cont.)
- We need a method of reducing summation overhead.
- We should be able to compute any sum of items in
A in logarithmic time. - The solution Keeping computed sums of intervals.
13Dyadic intervals - defined
- Atomic dyadic interval a single point.
- Ij,k k2log(U)-j,(k1)2log(U)-j-1
- j resolution level
- Example
I(3,0)
I(3,1)
I(3,2)
I(3,3)
I(3,4)
I(3,5)
I(3,6)
I(3,7)
0
1
2
3
4
5
6
7
I(2,0)
I(2,1)
I(2,2)
I(2,3)
I(1,0)
I(1,1)
I(0,0)
14Computing an arbitrary interval
- Lets say we have sums for all dyadic intervals
as in the above example. - We want to compute A0,6.
- A0,6 I(1,0) I(2,2) I(3,6)
I(3,0)
I(3,1)
I(3,2)
I(3,3)
I(3,4)
I(3,5)
I(3,6)
I(3,7)
0
1
2
3
4
5
6
7
I(2,0)
I(2,1)
I(2,2)
I(2,3)
I(1,0)
I(1,1)
I(0,0)
15Dyadic intervals - observations
- Log(U) 1 resolution levels
- 2U - 1 dyadic intervals altogether
- O(U) space needed to keep them all
- O(log(U)) time needed to compute any arbitrary
interval.
16Computing quantiles (Cont.)
- We can now efficiently compute any arbitrary
interval in A. - A ?-quantile for any k can be computed thus
- We need a jk s.t. A0,jk) lt k?N lt a0,jk1)
- Use binary search to find it !
17But
- Keeping O(U) of data presents a real space
complexity problem. - We need a way of estimating Ai on demand.
- And also of estimating any dyadic interval on
demand.
18Introducing random sets
- Let S be a random set of values from U.
- Each value has a probability of ½ of being in S.
- Expectation of the number of items in S is ½U.
19Random subset sums
- Define AS as the number of items in A with
values in S. - Expectation of AS is ½A½N.
- Now consider only subsets S containing a certain
value i.
20Random subset sums (cont.)
- Suppose we keep a number of random sets S, each
containing random values from U each with
probability ½. - We maintain AS for each such set.
- Easy to maintain during updates.
- How can we now estimate Ai ?
21Random subset sums (cont.)
- We can estimate Ai for any i withAi
2AS - A - Proof
- The authors prove that repeating the process
O(1/e2) times yields the required accuracy.
22Random subset sums (cont.)
- We can also estimate any dyadic interval Ij,k
using the same method. - Improvement We can compute the sums for dyadic
intervals from a certain level. - We can now estimate any arbitrary interval in the
universe
23Space Considerations
- Keeping a set of expected size ½U is still
O(U). - We need a method of keeping a set without
actually keeping it - The technique instead of sets, keep random seeds
of size o(logU) bits and compute whether a
given i?S on demand.
24Extended Hamming Code
- Used for generating the random sets.
- Provides sufficient randomness
- For example
- U 8
- Seed size logU1 4
- G(seed, i) seed X ith column
25RSS Algorithm Summary
- To compute a dyadic interval.
- Compute 2AS - A for sets containing the
given dyadic interval. - To compute an arbitrary interval.
- Write it as a disjoint union of dyadic intervals,
estimate them and take a median over possible
results (simplified). - To compute the quantiles.
- Use binary search and compute the intervals until
found.
26Algorithm Complexity Claim
- The RSS algorithms space complexity (for t
quantile queries) - Time complexity for inserts, deletes and
computing each quantile on demand is proportional
to the space used.
27Proof Outline
- Declare random variable
- Xk2AIk if Ik is in S and 0 otherwise
- X Sum of all Xks in a certain set
- Y Sum of all Xs in a given interval
- Z A number of repetitions of X.
28Proof Outline (Cont.)
- In a similar fashion to previous slides, show
that Y and A can be used to compute AI. - Compute the variance.
- Use Chebyshevs and then Chernoffs inequalities,
together with the computed variance, to achieve
the required result.
29What If U Is Unknown ?
- In practice, the universe U is not always known.
- Predict a range 0, u-1 for U.
- Given an inserted (or updated) value i s.t. (i gt
u-1), add another instance of RSS with range u,
u2-1, and so on - Estimating dyadic intervals can be done in a
single instance of RSS. - Increased cost factor log2log(U).
30Some RSS Properties
- RSS may return as a quantile a value which is not
really in the dataset. - Order of insertions and deletions does not affect
result and accuracy. - Can be parallelized quite easily (as long as
random subsets are pre-agreed).
31Experimental Results
- Experiments
- Static artificial dataset
- Dynamic artificial dataset
- Dynamic real dataset
- Participants
- Naïvel
- RSSl
- GK
- GK2 an improvement for GK
32Static Artificial Dataset
- U 220
- Compute 15 quantiles at position (1/16)k for k
1,2,,15. - 3 different distributions
- Uniform
- Zipf
- Normalm,v
- Algorithm used RSS7 (11K footprint).
33Errors for Zipf data
34Errors for NormalU/2, U/50 Distribution
35Dynamic Artificial Dataset
- Insert N104,858 items from uniform dist.
D1Uni1,U, U220. - Insert aN more items from uniform dist.
D2UniU/2-U/32, U/2U/32. - Delete all values from the first insertion.
- Parameter a controls the mass of the second
insertion with respect to the first.
36Dynamic Artificial Dataset Results
37Dynamic Real Dataset
- Based on true Call Detail Records (CDRs) from
ATT. - Dataset used includes 4.42 million CDRs covering
a period of 18 hours. - Objective find the median length of current
calls. - Probe for estimates every 10,000 records.
- Algorithm used RSS6 (4K footprint).
38Number of Active Phone Calls Over Time
39Error in Computation of Median Over Time
40Average Error for Last 50 Snapshots, For Deciles
41Conclusions RSS
- Algorithm for maintaining dynamic quantiles.
- Works well (within a user-defined precision) both
for insertions AND deletions. - Polylogarithmic (in universe size) in space and
time complexities.
42Thanks for listening !