Title: Maintaining Variance and k-Medians over Data Stream Windows
1Maintaining Variance and k-Medians
over Data Stream Windows
- Brian Babcock, Mayur Datar, Rajeev Motwani,
Liadan OCallaghan - Stanford University
2Data Streams andSliding Windows
- Streaming data model
- Useful for applications with high data volumes,
timeliness requirements - Data processed in single pass
- Limited memory (sublinear in stream size)
- Sliding window model
- Variation of streaming data model
- Only recent data matters
- Parameterized by window size N
- Limited memory (sublinear in window size)
3Sliding Window (SW) Model
Time Increases
.1 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0
0 1 0 1 0 0 1 1
Window Size N 7
Current Time
4Variance and k-Medians
- Variance S(xi µ)2, µ S xi/N
- k-median clustering
- Given N points (x1 xN) in a metric space
- Find k points C c1, c2, , ck that minimize
S d(xi, C) (the assignment distance)
5Previous Results in SW Model
- Count of non-zero elements /Sum of positive
integers DGIM02 - (1 e) approximation
- Space ?((1/e)(log N)) words ?((1/e)(log2 N))
bits - Update time ?(log N) worst case, ?(1) amortized
- Improved to ?(1) worst case by GT02
- Exponential Histogram (EH) data structure
- Generalized SW model CS03 (previous talk)
6Results Variance
- (1 e) approximation
- Space O((1/e2) log N) words
- Update Time O(1) amortized, O((1/e2) log N)
worst case
7Results k-medians
- 2O(1/t) approximation of assignment distance (0 lt
t lt ½) - Space O((k/t4)N2t)
- Update time O(k) amortized, O((k2/t3)N2t)
worst case - Query time O((k2/t3)N2t)
8Remainder of the Talk
- Overview of Exponential Histogram
- Where EH fails and how to fix it
- Algorithm for Variance
- Main ideas in k-medians algorithm
- Open problems
9Sliding Window Computation
- Main difficulty discount expiring data
- As each element arrives, one element expires
- Value of expiring element cant be known exactly
- How do we update our data structure?
- One solution Use histograms
.1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0
0 0 0 1 0
Bucket Sums 2,1,2
Bucket Sums 3,2,1,2
10Containing the Error
- Error comes from last bucket
- Need to ensure that contribution of last bucket
is not too big - Bad example
1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 0
0 0 0 0 0 0 0 0 0
Bucket Sums 4,4,4
Bucket Sums 4
11Exponential Histograms
- Exponential Histogram algorithm
- Initially buckets contain 1 item each
- Merge adjacent buckets once the sum of later
buckets is large enough
Bucket sums 4, 2, 2, 1
Bucket sums 4, 2, 2, 1, 1
Bucket sums 4, 2, 2, 1, 1 ,1
Bucket sums 4, 2, 2, 2, 1
Bucket sums 4, 4, 2, 1
.1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 1
1 1
12Where EH Goes Wrong
- DGIM02 Can estimate any function f defined
over windows that satisfies - Positive f(X) 0
- Polynomially bounded f(X) poly(X)
- Composable Can compute f(X Y) from f(X), f(Y)
and little additional information - Weakly Additive (f(X) f(Y)) f(X Y)
c(f(X) f(Y)) - Weakly Additive condition not valid for
variance, k-medians
13Notation
Current window, size N
Bm-1
B1
Bm
B2
Vi Variance of the ith bucket ni number of
elements in ith bucket µi mean of the ith bucket
14Variance composition
- Bi,j concatenation of buckets i and j
15Failure of Weak Additivity
Value
Variance of combinedbucket is large
Time
16Main Solution Idea
- More careful estimation of last buckets
contribution - Decompose variance into two parts
- Internal variance within bucket
- External variance between buckets
Internal Varianceof Bucket i
External Variance
Internal Varianceof Bucket j
17Main Solution Idea
- When estimating contribution of last bucket
- Internal variance charged evenly to each point
- External variance
- Pretend each point is at the average for its
bucket - Variance for bucket is small ? points arent
too far from the average - Points arent far from the average ? average
is a good approx. for each point
18Main Idea Illustration
Value
Spread
Time
- Spread is small ? external variance is small
- Spread is large ? error from bucket averaging
insignificant
19Variance error bound
Current window, size N
Bm-1
B1
Bm
B2
Bm
- Theorem Relative error e, provided Vm (e2/9)
Vm - Aim Maintain Vm (e2/9) Vm using as few
buckets as possible
20Variance algorithm
- EH algorithm for variance
- Initially buckets contain 1 item each
- Merge adjacent buckets i, i1 whenever the
following condition holds (9/e2) Vi,i-1
Vi-1(i.e. variance of merged bucket is small
compared to combined variance of later buckets)
21Invariants
- Invariant 1 (9/e2) Vi Vi
- Ensures that relative error is e
- Invariant 2 (9/e2) Vi,i-1 gt Vi-1
- Ensures that number of buckets O((1/e2)log N)
- Each bucket requires O(1) space
22Update and Query time
- Query Time O(1)
- We maintain n, V µ values for m and m
- Update Time O((1/e2) log N) worst case
- Time to check and combine buckets
- Can be made amortized O(1)
- Merge buckets periodically instead of after each
new data element
23k-medians summary (1/2)
- Assignment distance substitutes for variance
- Assignment distance obtained from an approximate
clustering of points in the bucket - Use hierarchical clustering algorithm GMMO00
- Original points cluster to give level-1 medians
- Level-i medians cluster to give level-(i1)
medians - Medians weighted by count of assigned points
- Each bucket maintains a collection of medians at
various levels
24k-medians summary (2/2)
- Merging buckets
- Combine medians from each level i
- If they exceed Nt in number, cluster to get level
i1 medians. - Estimation procedure
- Weighted clustering of all medians from all
buckets to produce k overall medians - Estimating contribution of last bucket
- Pretend each point is at the closest median
- Relies on approximate counts of active points
assigned to each median - See paper for details!
25Open Problems
- Variance
- Close gap between upper and lower bounds (1/e
log N vs. 1/e2 log N) - Improve update time from O(1) amortized to O(1)
worst-case - k-median clustering
- COP03 give polylog N space approx. algorithm
in streaming data model - Can a similar result be obtained in the sliding
window model?
26Conclusion
- Algorithms to approximately maintain variance and
k-median clustering in sliding window model - Previous results using Exponential Histograms
required weak additivity - Not satisfied by variance or k-median clustering
- Adapted EHs for variance and k-median
- Techniques may be useful for other statistics
that violate weak additivity