How to Summarize the Universe: Dynamic Maintenance of Quantiles - PowerPoint PPT Presentation

About This Presentation

Title:

How to Summarize the Universe: Dynamic Maintenance of Quantiles

Description:

Blum, Floyd, Pratt, Rivest & Tarjan. Find the i'th element ... In a similar fashion to previous s, show that Y and ||A|| can be used to compute ||AI ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 43

Provided by: assaf8

Category:

more less

Transcript and Presenter's Notes

Title: How to Summarize the Universe: Dynamic Maintenance of Quantiles

1
How to Summarize the UniverseDynamic
Maintenance of Quantiles

By
Anna C. Gilbert
Yannis Kotidis
S. Muthukrishnan
Martin J. Strauss

2
Quantiles

Median, quartiles,
The general case
Uses
Statistics
Estimating result set size
Partitioning

3
Computing static quantiles

Blum, Floyd, Pratt, Rivest Tarjan
Find the ith element
Comparison based
Similar to QuickSort
O(n) worst case time

4
Problems with massive data sets

O(n) time not good enough
O(n) space usually not affordable
Dynamic environment
Cancellations are especially troublesome
Usually recomputed periodically
May be very inaccurate until recomputed

Some kind of approximation is the only choice !
5
Common approaches

Deterministically chosen sample
Randomization probability of failure
Maintaining a backing sample
Wavelets
Most of the above approaches work well for the
incremental case, but deletions may cause
inaccuracy.

6
GK Greenwald-Khanna (01)

Fill the available memory with values
Maintain rank ranges on values is memory.
When a new value is inserted, kick a value out of
memory.
Insert-only algorithm
Can be extended to support deletes (GK2).
Maintain two instances one for insertions and
one for deletions.

7
Maintenance of Equi-Depth Histograms (using a
backing sample)

Gibbons, Matias, Poosala 97
Scan the dataset and choose values for the sample
using the reservoir method.
Treat insertions as a continuous scan.
When a deletion from the sample is necessary
rescan only if number of items drops below a
specified minimum.
Works well for a mostly-insertions enviornment.

8
The authors main result

The RSS algorithm
RSS Random Subset Sum
Space polylogarithmic in universe size
Proportional time
A priori guarantee of accuracy within a user
specified error e, with a user specified
probability of failure d.

9
Some formalism

The universe U 0, , U -1
Number of tuples in data set AN
Data set can be thought of as an arrayAi
number of tuples with value i
Our goal for computing ?-quantiles find a jk
such that

10
Some assumptions

The universes size is known
Later well throw that assumption away
Update Delete Insert

11
Computing quantiles

Lets say Ai is known for every i.
Easy to maintain through updates
Summing up array items ?
Not a very good complexity

12
Computing quantiles (cont.)

We need a method of reducing summation overhead.
We should be able to compute any sum of items in
A in logarithmic time.
The solution Keeping computed sums of intervals.

13
Dyadic intervals - defined

Atomic dyadic interval a single point.
Ij,k k2log(U)-j,(k1)2log(U)-j-1
j resolution level
Example

I(3,0)
I(3,1)
I(3,2)
I(3,3)
I(3,4)
I(3,5)
I(3,6)
I(3,7)
0
1
2
3
4
5
6
7
I(2,0)
I(2,1)
I(2,2)
I(2,3)
I(1,0)
I(1,1)
I(0,0)
14
Computing an arbitrary interval

Lets say we have sums for all dyadic intervals
as in the above example.
We want to compute A0,6.
A0,6 I(1,0) I(2,2) I(3,6)

I(3,0)
I(3,1)
I(3,2)
I(3,3)
I(3,4)
I(3,5)
I(3,6)
I(3,7)
0
1
2
3
4
5
6
7
I(2,0)
I(2,1)
I(2,2)
I(2,3)
I(1,0)
I(1,1)
I(0,0)
15
Dyadic intervals - observations

Log(U) 1 resolution levels
2U - 1 dyadic intervals altogether
O(U) space needed to keep them all
O(log(U)) time needed to compute any arbitrary
interval.

16
Computing quantiles (Cont.)

We can now efficiently compute any arbitrary
interval in A.
A ?-quantile for any k can be computed thus
We need a jk s.t. A0,jk) lt k?N lt a0,jk1)
Use binary search to find it !

17
But

Keeping O(U) of data presents a real space
complexity problem.
We need a way of estimating Ai on demand.
And also of estimating any dyadic interval on
demand.

18
Introducing random sets

Let S be a random set of values from U.
Each value has a probability of ½ of being in S.
Expectation of the number of items in S is ½U.

19
Random subset sums

Define AS as the number of items in A with
values in S.
Expectation of AS is ½A½N.
Now consider only subsets S containing a certain
value i.

20
Random subset sums (cont.)

Suppose we keep a number of random sets S, each
containing random values from U each with
probability ½.
We maintain AS for each such set.
Easy to maintain during updates.
How can we now estimate Ai ?

21
Random subset sums (cont.)

We can estimate Ai for any i withAi
2AS - A
Proof
The authors prove that repeating the process
O(1/e2) times yields the required accuracy.

22
Random subset sums (cont.)

We can also estimate any dyadic interval Ij,k
using the same method.
Improvement We can compute the sums for dyadic
intervals from a certain level.
We can now estimate any arbitrary interval in the
universe

23
Space Considerations

Keeping a set of expected size ½U is still
O(U).
We need a method of keeping a set without
actually keeping it
The technique instead of sets, keep random seeds
of size o(logU) bits and compute whether a
given i?S on demand.

24
Extended Hamming Code

Used for generating the random sets.
Provides sufficient randomness
For example
U 8
Seed size logU1 4
G(seed, i) seed X ith column

25
RSS Algorithm Summary

To compute a dyadic interval.
Compute 2AS - A for sets containing the
given dyadic interval.
To compute an arbitrary interval.
Write it as a disjoint union of dyadic intervals,
estimate them and take a median over possible
results (simplified).
To compute the quantiles.
Use binary search and compute the intervals until
found.

26
Algorithm Complexity Claim

The RSS algorithms space complexity (for t
quantile queries)
Time complexity for inserts, deletes and
computing each quantile on demand is proportional
to the space used.

27
Proof Outline

Declare random variable
Xk2AIk if Ik is in S and 0 otherwise
X Sum of all Xks in a certain set
Y Sum of all Xs in a given interval
Z A number of repetitions of X.

28
Proof Outline (Cont.)

In a similar fashion to previous slides, show
that Y and A can be used to compute AI.
Compute the variance.
Use Chebyshevs and then Chernoffs inequalities,
together with the computed variance, to achieve
the required result.

29
What If U Is Unknown ?

In practice, the universe U is not always known.
Predict a range 0, u-1 for U.
Given an inserted (or updated) value i s.t. (i gt
u-1), add another instance of RSS with range u,
u2-1, and so on
Estimating dyadic intervals can be done in a
single instance of RSS.
Increased cost factor log2log(U).

30
Some RSS Properties

RSS may return as a quantile a value which is not
really in the dataset.
Order of insertions and deletions does not affect
result and accuracy.
Can be parallelized quite easily (as long as
random subsets are pre-agreed).

31
Experimental Results

Experiments
Static artificial dataset
Dynamic artificial dataset
Dynamic real dataset
Participants
Naïvel
RSSl
GK
GK2 an improvement for GK

32
Static Artificial Dataset

U 220
Compute 15 quantiles at position (1/16)k for k
1,2,,15.
3 different distributions
Uniform
Zipf
Normalm,v
Algorithm used RSS7 (11K footprint).

33
Errors for Zipf data
34
Errors for NormalU/2, U/50 Distribution
35
Dynamic Artificial Dataset

Insert N104,858 items from uniform dist.
D1Uni1,U, U220.
Insert aN more items from uniform dist.
D2UniU/2-U/32, U/2U/32.
Delete all values from the first insertion.
Parameter a controls the mass of the second
insertion with respect to the first.

36
Dynamic Artificial Dataset Results
37
Dynamic Real Dataset

Based on true Call Detail Records (CDRs) from
ATT.
Dataset used includes 4.42 million CDRs covering
a period of 18 hours.
Objective find the median length of current
calls.
Probe for estimates every 10,000 records.
Algorithm used RSS6 (4K footprint).

38
Number of Active Phone Calls Over Time
39
Error in Computation of Median Over Time
40
Average Error for Last 50 Snapshots, For Deciles
41
Conclusions RSS