Maintaining Variance and k-Medians over Data Stream Windows

About This Presentation

Title:

Maintaining Variance and k-Medians over Data Stream Windows

Description:

Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O Callaghan Stanford University – PowerPoint PPT presentation

Number of Views:4

Avg rating:3.0/5.0

Slides: 27

Provided by: Mayu152

Learn more at: http://www-cs-students.stanford.edu

more less

Transcript and Presenter's Notes

Title: Maintaining Variance and k-Medians over Data Stream Windows

1
Maintaining Variance and k-Medians
over Data Stream Windows

Brian Babcock, Mayur Datar, Rajeev Motwani,
Liadan OCallaghan
Stanford University

2
Data Streams andSliding Windows

Streaming data model
Useful for applications with high data volumes,
timeliness requirements
Data processed in single pass
Limited memory (sublinear in stream size)
Sliding window model
Variation of streaming data model
Only recent data matters
Parameterized by window size N
Limited memory (sublinear in window size)

3
Sliding Window (SW) Model
Time Increases
.1 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0
0 1 0 1 0 0 1 1
Window Size N 7
Current Time
4
Variance and k-Medians

Variance S(xi µ)2, µ S xi/N
k-median clustering
Given N points (x1 xN) in a metric space
Find k points C c1, c2, , ck that minimize
S d(xi, C) (the assignment distance)

5
Previous Results in SW Model

Count of non-zero elements /Sum of positive
integers DGIM02
(1 e) approximation
Space ?((1/e)(log N)) words ?((1/e)(log2 N))
bits
Update time ?(log N) worst case, ?(1) amortized
Improved to ?(1) worst case by GT02
Exponential Histogram (EH) data structure
Generalized SW model CS03 (previous talk)

6
Results Variance

(1 e) approximation
Space O((1/e2) log N) words
Update Time O(1) amortized, O((1/e2) log N)
worst case

7
Results k-medians

2O(1/t) approximation of assignment distance (0 lt
t lt ½)
Space O((k/t4)N2t)
Update time O(k) amortized, O((k2/t3)N2t)
worst case
Query time O((k2/t3)N2t)

8
Remainder of the Talk

Overview of Exponential Histogram
Where EH fails and how to fix it
Algorithm for Variance
Main ideas in k-medians algorithm
Open problems

9
Sliding Window Computation

Main difficulty discount expiring data
As each element arrives, one element expires
Value of expiring element cant be known exactly
How do we update our data structure?
One solution Use histograms

.1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0
0 0 0 1 0
Bucket Sums 2,1,2
Bucket Sums 3,2,1,2
10
Containing the Error

Error comes from last bucket
Need to ensure that contribution of last bucket
is not too big
Bad example

1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 0
0 0 0 0 0 0 0 0 0
Bucket Sums 4,4,4
Bucket Sums 4
11
Exponential Histograms

Exponential Histogram algorithm
Initially buckets contain 1 item each
Merge adjacent buckets once the sum of later
buckets is large enough

Bucket sums 4, 2, 2, 1
Bucket sums 4, 2, 2, 1, 1
Bucket sums 4, 2, 2, 1, 1 ,1
Bucket sums 4, 2, 2, 2, 1
Bucket sums 4, 4, 2, 1
.1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 1
1 1
12
Where EH Goes Wrong

DGIM02 Can estimate any function f defined
over windows that satisfies
Positive f(X) 0
Polynomially bounded f(X) poly(X)
Composable Can compute f(X Y) from f(X), f(Y)
and little additional information
Weakly Additive (f(X) f(Y)) f(X Y)
c(f(X) f(Y))
Weakly Additive condition not valid for
variance, k-medians

13
Notation
Current window, size N

Bm-1
B1
Bm
B2
Vi Variance of the ith bucket ni number of
elements in ith bucket µi mean of the ith bucket
14
Variance composition

Bi,j concatenation of buckets i and j

15
Failure of Weak Additivity
Value
Variance of combinedbucket is large
Time
16
Main Solution Idea

More careful estimation of last buckets
contribution
Decompose variance into two parts
Internal variance within bucket
External variance between buckets

Internal Varianceof Bucket i
External Variance
Internal Varianceof Bucket j
17
Main Solution Idea

When estimating contribution of last bucket
Internal variance charged evenly to each point
External variance
Pretend each point is at the average for its
bucket
Variance for bucket is small ? points arent
too far from the average
Points arent far from the average ? average
is a good approx. for each point

18
Main Idea Illustration
Value
Spread
Time

Spread is small ? external variance is small
Spread is large ? error from bucket averaging
insignificant

19
Variance error bound
Current window, size N

Bm-1
B1
Bm
B2
Bm

Theorem Relative error e, provided Vm (e2/9)
Vm
Aim Maintain Vm (e2/9) Vm using as few
buckets as possible

20
Variance algorithm

EH algorithm for variance
Initially buckets contain 1 item each
Merge adjacent buckets i, i1 whenever the
following condition holds (9/e2) Vi,i-1
Vi-1(i.e. variance of merged bucket is small
compared to combined variance of later buckets)

21
Invariants

Invariant 1 (9/e2) Vi Vi
Ensures that relative error is e
Invariant 2 (9/e2) Vi,i-1 gt Vi-1
Ensures that number of buckets O((1/e2)log N)
Each bucket requires O(1) space

22
Update and Query time

Query Time O(1)
We maintain n, V µ values for m and m
Update Time O((1/e2) log N) worst case
Time to check and combine buckets
Can be made amortized O(1)
Merge buckets periodically instead of after each
new data element

23
k-medians summary (1/2)

Assignment distance substitutes for variance
Assignment distance obtained from an approximate
clustering of points in the bucket
Use hierarchical clustering algorithm GMMO00
Original points cluster to give level-1 medians
Level-i medians cluster to give level-(i1)
medians
Medians weighted by count of assigned points
Each bucket maintains a collection of medians at
various levels

24
k-medians summary (2/2)

Merging buckets
Combine medians from each level i
If they exceed Nt in number, cluster to get level
i1 medians.
Estimation procedure
Weighted clustering of all medians from all
buckets to produce k overall medians
Estimating contribution of last bucket
Pretend each point is at the closest median
Relies on approximate counts of active points
assigned to each median
See paper for details!

25
Open Problems

Variance
Close gap between upper and lower bounds (1/e
log N vs. 1/e2 log N)
Improve update time from O(1) amortized to O(1)
worst-case
k-median clustering
COP03 give polylog N space approx. algorithm
in streaming data model
Can a similar result be obtained in the sliding
window model?

26
Conclusion

Algorithms to approximately maintain variance and
k-median clustering in sliding window model
Previous results using Exponential Histograms
required weak additivity
Not satisfied by variance or k-median clustering
Adapted EHs for variance and k-median
Techniques may be useful for other statistics
that violate weak additivity

Write a Comment

User Comments (0)