How to Summarize the Universe: Dynamic Maintenance of Quantiles - PowerPoint PPT Presentation

About This Presentation
Title:

How to Summarize the Universe: Dynamic Maintenance of Quantiles

Description:

Blum, Floyd, Pratt, Rivest & Tarjan. Find the i'th element ... In a similar fashion to previous s, show that Y and ||A|| can be used to compute ||AI ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 43
Provided by: assaf8
Category:

less

Transcript and Presenter's Notes

Title: How to Summarize the Universe: Dynamic Maintenance of Quantiles


1
How to Summarize the UniverseDynamic
Maintenance of Quantiles
  • By
  • Anna C. Gilbert
  • Yannis Kotidis
  • S. Muthukrishnan
  • Martin J. Strauss

2
Quantiles
  • Median, quartiles,
  • The general case
  • Uses
  • Statistics
  • Estimating result set size
  • Partitioning

3
Computing static quantiles
  • Blum, Floyd, Pratt, Rivest Tarjan
  • Find the ith element
  • Comparison based
  • Similar to QuickSort
  • O(n) worst case time

4
Problems with massive data sets
  • O(n) time not good enough
  • O(n) space usually not affordable
  • Dynamic environment
  • Cancellations are especially troublesome
  • Usually recomputed periodically
  • May be very inaccurate until recomputed

Some kind of approximation is the only choice !
5
Common approaches
  • Deterministically chosen sample
  • Randomization probability of failure
  • Maintaining a backing sample
  • Wavelets
  • Most of the above approaches work well for the
    incremental case, but deletions may cause
    inaccuracy.

6
GK Greenwald-Khanna (01)
  • Fill the available memory with values
  • Maintain rank ranges on values is memory.
  • When a new value is inserted, kick a value out of
    memory.
  • Insert-only algorithm
  • Can be extended to support deletes (GK2).
  • Maintain two instances one for insertions and
    one for deletions.

7
Maintenance of Equi-Depth Histograms (using a
backing sample)
  • Gibbons, Matias, Poosala 97
  • Scan the dataset and choose values for the sample
    using the reservoir method.
  • Treat insertions as a continuous scan.
  • When a deletion from the sample is necessary
    rescan only if number of items drops below a
    specified minimum.
  • Works well for a mostly-insertions enviornment.

8
The authors main result
  • The RSS algorithm
  • RSS Random Subset Sum
  • Space polylogarithmic in universe size
  • Proportional time
  • A priori guarantee of accuracy within a user
    specified error e, with a user specified
    probability of failure d.

9
Some formalism
  • The universe U 0, , U -1
  • Number of tuples in data set AN
  • Data set can be thought of as an arrayAi
    number of tuples with value i
  • Our goal for computing ?-quantiles find a jk
    such that

10
Some assumptions
  • The universes size is known
  • Later well throw that assumption away
  • Update Delete Insert

11
Computing quantiles
  • Lets say Ai is known for every i.
  • Easy to maintain through updates
  • Summing up array items ?
  • Not a very good complexity

12
Computing quantiles (cont.)
  • We need a method of reducing summation overhead.
  • We should be able to compute any sum of items in
    A in logarithmic time.
  • The solution Keeping computed sums of intervals.

13
Dyadic intervals - defined
  • Atomic dyadic interval a single point.
  • Ij,k k2log(U)-j,(k1)2log(U)-j-1
  • j resolution level
  • Example

I(3,0)
I(3,1)
I(3,2)
I(3,3)
I(3,4)
I(3,5)
I(3,6)
I(3,7)
0
1
2
3
4
5
6
7
I(2,0)
I(2,1)
I(2,2)
I(2,3)
I(1,0)
I(1,1)
I(0,0)
14
Computing an arbitrary interval
  • Lets say we have sums for all dyadic intervals
    as in the above example.
  • We want to compute A0,6.
  • A0,6 I(1,0) I(2,2) I(3,6)

I(3,0)
I(3,1)
I(3,2)
I(3,3)
I(3,4)
I(3,5)
I(3,6)
I(3,7)
0
1
2
3
4
5
6
7
I(2,0)
I(2,1)
I(2,2)
I(2,3)
I(1,0)
I(1,1)
I(0,0)
15
Dyadic intervals - observations
  • Log(U) 1 resolution levels
  • 2U - 1 dyadic intervals altogether
  • O(U) space needed to keep them all
  • O(log(U)) time needed to compute any arbitrary
    interval.

16
Computing quantiles (Cont.)
  • We can now efficiently compute any arbitrary
    interval in A.
  • A ?-quantile for any k can be computed thus
  • We need a jk s.t. A0,jk) lt k?N lt a0,jk1)
  • Use binary search to find it !

17
But
  • Keeping O(U) of data presents a real space
    complexity problem.
  • We need a way of estimating Ai on demand.
  • And also of estimating any dyadic interval on
    demand.

18
Introducing random sets
  • Let S be a random set of values from U.
  • Each value has a probability of ½ of being in S.
  • Expectation of the number of items in S is ½U.

19
Random subset sums
  • Define AS as the number of items in A with
    values in S.
  • Expectation of AS is ½A½N.
  • Now consider only subsets S containing a certain
    value i.

20
Random subset sums (cont.)
  • Suppose we keep a number of random sets S, each
    containing random values from U each with
    probability ½.
  • We maintain AS for each such set.
  • Easy to maintain during updates.
  • How can we now estimate Ai ?

21
Random subset sums (cont.)
  • We can estimate Ai for any i withAi
    2AS - A
  • Proof
  • The authors prove that repeating the process
    O(1/e2) times yields the required accuracy.

22
Random subset sums (cont.)
  • We can also estimate any dyadic interval Ij,k
    using the same method.
  • Improvement We can compute the sums for dyadic
    intervals from a certain level.
  • We can now estimate any arbitrary interval in the
    universe

23
Space Considerations
  • Keeping a set of expected size ½U is still
    O(U).
  • We need a method of keeping a set without
    actually keeping it
  • The technique instead of sets, keep random seeds
    of size o(logU) bits and compute whether a
    given i?S on demand.

24
Extended Hamming Code
  • Used for generating the random sets.
  • Provides sufficient randomness
  • For example
  • U 8
  • Seed size logU1 4
  • G(seed, i) seed X ith column

25
RSS Algorithm Summary
  • To compute a dyadic interval.
  • Compute 2AS - A for sets containing the
    given dyadic interval.
  • To compute an arbitrary interval.
  • Write it as a disjoint union of dyadic intervals,
    estimate them and take a median over possible
    results (simplified).
  • To compute the quantiles.
  • Use binary search and compute the intervals until
    found.

26
Algorithm Complexity Claim
  • The RSS algorithms space complexity (for t
    quantile queries)
  • Time complexity for inserts, deletes and
    computing each quantile on demand is proportional
    to the space used.

27
Proof Outline
  • Declare random variable
  • Xk2AIk if Ik is in S and 0 otherwise
  • X Sum of all Xks in a certain set
  • Y Sum of all Xs in a given interval
  • Z A number of repetitions of X.

28
Proof Outline (Cont.)
  • In a similar fashion to previous slides, show
    that Y and A can be used to compute AI.
  • Compute the variance.
  • Use Chebyshevs and then Chernoffs inequalities,
    together with the computed variance, to achieve
    the required result.

29
What If U Is Unknown ?
  • In practice, the universe U is not always known.
  • Predict a range 0, u-1 for U.
  • Given an inserted (or updated) value i s.t. (i gt
    u-1), add another instance of RSS with range u,
    u2-1, and so on
  • Estimating dyadic intervals can be done in a
    single instance of RSS.
  • Increased cost factor log2log(U).

30
Some RSS Properties
  • RSS may return as a quantile a value which is not
    really in the dataset.
  • Order of insertions and deletions does not affect
    result and accuracy.
  • Can be parallelized quite easily (as long as
    random subsets are pre-agreed).

31
Experimental Results
  • Experiments
  • Static artificial dataset
  • Dynamic artificial dataset
  • Dynamic real dataset
  • Participants
  • Naïvel
  • RSSl
  • GK
  • GK2 an improvement for GK

32
Static Artificial Dataset
  • U 220
  • Compute 15 quantiles at position (1/16)k for k
    1,2,,15.
  • 3 different distributions
  • Uniform
  • Zipf
  • Normalm,v
  • Algorithm used RSS7 (11K footprint).

33
Errors for Zipf data
34
Errors for NormalU/2, U/50 Distribution
35
Dynamic Artificial Dataset
  • Insert N104,858 items from uniform dist.
    D1Uni1,U, U220.
  • Insert aN more items from uniform dist.
    D2UniU/2-U/32, U/2U/32.
  • Delete all values from the first insertion.
  • Parameter a controls the mass of the second
    insertion with respect to the first.

36
Dynamic Artificial Dataset Results
37
Dynamic Real Dataset
  • Based on true Call Detail Records (CDRs) from
    ATT.
  • Dataset used includes 4.42 million CDRs covering
    a period of 18 hours.
  • Objective find the median length of current
    calls.
  • Probe for estimates every 10,000 records.
  • Algorithm used RSS6 (4K footprint).

38
Number of Active Phone Calls Over Time
39
Error in Computation of Median Over Time
40
Average Error for Last 50 Snapshots, For Deciles
41
Conclusions RSS
  • Algorithm for maintaining dynamic quantiles.
  • Works well (within a user-defined precision) both
    for insertions AND deletions.
  • Polylogarithmic (in universe size) in space and
    time complexities.

42
Thanks for listening !
Write a Comment
User Comments (0)
About PowerShow.com