Maintaining Variance and k-Medians over Data Stream Windows

About This Presentation
Title:

Maintaining Variance and k-Medians over Data Stream Windows

Description:

Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O Callaghan Stanford University – PowerPoint PPT presentation

Number of Views:4
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Maintaining Variance and k-Medians over Data Stream Windows


1
Maintaining Variance and k-Medians
over Data Stream Windows
  • Brian Babcock, Mayur Datar, Rajeev Motwani,
    Liadan OCallaghan
  • Stanford University

2
Data Streams andSliding Windows
  • Streaming data model
  • Useful for applications with high data volumes,
    timeliness requirements
  • Data processed in single pass
  • Limited memory (sublinear in stream size)
  • Sliding window model
  • Variation of streaming data model
  • Only recent data matters
  • Parameterized by window size N
  • Limited memory (sublinear in window size)

3
Sliding Window (SW) Model
Time Increases
.1 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0
0 1 0 1 0 0 1 1
Window Size N 7
Current Time
4
Variance and k-Medians
  • Variance S(xi µ)2, µ S xi/N
  • k-median clustering
  • Given N points (x1 xN) in a metric space
  • Find k points C c1, c2, , ck that minimize
    S d(xi, C) (the assignment distance)

5
Previous Results in SW Model
  • Count of non-zero elements /Sum of positive
    integers DGIM02
  • (1 e) approximation
  • Space ?((1/e)(log N)) words ?((1/e)(log2 N))
    bits
  • Update time ?(log N) worst case, ?(1) amortized
  • Improved to ?(1) worst case by GT02
  • Exponential Histogram (EH) data structure
  • Generalized SW model CS03 (previous talk)

6
Results Variance
  • (1 e) approximation
  • Space O((1/e2) log N) words
  • Update Time O(1) amortized, O((1/e2) log N)
    worst case

7
Results k-medians
  • 2O(1/t) approximation of assignment distance (0 lt
    t lt ½)
  • Space O((k/t4)N2t)
  • Update time O(k) amortized, O((k2/t3)N2t)
    worst case
  • Query time O((k2/t3)N2t)





8
Remainder of the Talk
  • Overview of Exponential Histogram
  • Where EH fails and how to fix it
  • Algorithm for Variance
  • Main ideas in k-medians algorithm
  • Open problems

9
Sliding Window Computation
  • Main difficulty discount expiring data
  • As each element arrives, one element expires
  • Value of expiring element cant be known exactly
  • How do we update our data structure?
  • One solution Use histograms

.1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0
0 0 0 1 0
Bucket Sums 2,1,2
Bucket Sums 3,2,1,2
10
Containing the Error
  • Error comes from last bucket
  • Need to ensure that contribution of last bucket
    is not too big
  • Bad example

1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 0
0 0 0 0 0 0 0 0 0
Bucket Sums 4,4,4
Bucket Sums 4
11
Exponential Histograms
  • Exponential Histogram algorithm
  • Initially buckets contain 1 item each
  • Merge adjacent buckets once the sum of later
    buckets is large enough

Bucket sums 4, 2, 2, 1
Bucket sums 4, 2, 2, 1, 1
Bucket sums 4, 2, 2, 1, 1 ,1
Bucket sums 4, 2, 2, 2, 1
Bucket sums 4, 4, 2, 1
.1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 1
1 1
12
Where EH Goes Wrong
  • DGIM02 Can estimate any function f defined
    over windows that satisfies
  • Positive f(X) 0
  • Polynomially bounded f(X) poly(X)
  • Composable Can compute f(X Y) from f(X), f(Y)
    and little additional information
  • Weakly Additive (f(X) f(Y)) f(X Y)
    c(f(X) f(Y))
  • Weakly Additive condition not valid for
    variance, k-medians

13
Notation
Current window, size N

Bm-1
B1
Bm
B2
Vi Variance of the ith bucket ni number of
elements in ith bucket µi mean of the ith bucket
14
Variance composition
  • Bi,j concatenation of buckets i and j

15
Failure of Weak Additivity
Value
Variance of combinedbucket is large
Time
16
Main Solution Idea
  • More careful estimation of last buckets
    contribution
  • Decompose variance into two parts
  • Internal variance within bucket
  • External variance between buckets

Internal Varianceof Bucket i
External Variance
Internal Varianceof Bucket j
17
Main Solution Idea
  • When estimating contribution of last bucket
  • Internal variance charged evenly to each point
  • External variance
  • Pretend each point is at the average for its
    bucket
  • Variance for bucket is small ? points arent
    too far from the average
  • Points arent far from the average ? average
    is a good approx. for each point

18
Main Idea Illustration
Value
Spread
Time
  • Spread is small ? external variance is small
  • Spread is large ? error from bucket averaging
    insignificant

19
Variance error bound
Current window, size N

Bm-1
B1
Bm
B2
Bm
  • Theorem Relative error e, provided Vm (e2/9)
    Vm
  • Aim Maintain Vm (e2/9) Vm using as few
    buckets as possible

20
Variance algorithm
  • EH algorithm for variance
  • Initially buckets contain 1 item each
  • Merge adjacent buckets i, i1 whenever the
    following condition holds (9/e2) Vi,i-1
    Vi-1(i.e. variance of merged bucket is small
    compared to combined variance of later buckets)

21
Invariants
  • Invariant 1 (9/e2) Vi Vi
  • Ensures that relative error is e
  • Invariant 2 (9/e2) Vi,i-1 gt Vi-1
  • Ensures that number of buckets O((1/e2)log N)
  • Each bucket requires O(1) space

22
Update and Query time
  • Query Time O(1)
  • We maintain n, V µ values for m and m
  • Update Time O((1/e2) log N) worst case
  • Time to check and combine buckets
  • Can be made amortized O(1)
  • Merge buckets periodically instead of after each
    new data element

23
k-medians summary (1/2)
  • Assignment distance substitutes for variance
  • Assignment distance obtained from an approximate
    clustering of points in the bucket
  • Use hierarchical clustering algorithm GMMO00
  • Original points cluster to give level-1 medians
  • Level-i medians cluster to give level-(i1)
    medians
  • Medians weighted by count of assigned points
  • Each bucket maintains a collection of medians at
    various levels

24
k-medians summary (2/2)
  • Merging buckets
  • Combine medians from each level i
  • If they exceed Nt in number, cluster to get level
    i1 medians.
  • Estimation procedure
  • Weighted clustering of all medians from all
    buckets to produce k overall medians
  • Estimating contribution of last bucket
  • Pretend each point is at the closest median
  • Relies on approximate counts of active points
    assigned to each median
  • See paper for details!

25
Open Problems
  • Variance
  • Close gap between upper and lower bounds (1/e
    log N vs. 1/e2 log N)
  • Improve update time from O(1) amortized to O(1)
    worst-case
  • k-median clustering
  • COP03 give polylog N space approx. algorithm
    in streaming data model
  • Can a similar result be obtained in the sliding
    window model?

26
Conclusion
  • Algorithms to approximately maintain variance and
    k-median clustering in sliding window model
  • Previous results using Exponential Histograms
    required weak additivity
  • Not satisfied by variance or k-median clustering
  • Adapted EHs for variance and k-median
  • Techniques may be useful for other statistics
    that violate weak additivity
Write a Comment
User Comments (0)