Title: Maintaining Variance and k-Medians over Data Stream Windows
1Maintaining Variance and k-Medians over Data
Stream Windows
- Paper by Brian Babcock, Mayur Datar, Rajeev
Motwani and Liadan OCallaghan.
Presentation by Anat Rapoport December 2003.
2Characteristics of the data stream
- Data elements arrive continually
- Only the most recent N elements are used when
answering queries - Single linear scan algorithm (can only have one
look) - Store only the summery of the data seen thus far.
3Introduction
- Two important and related problems
- Variance
- K-median clustering
4Problem 1 (Variance)
- Given a stream of numbers, maintain at every
instant the variance of the last N values - where denotes the mean of the
last N values
5Problem 1 (Variance)
- We cannot buffer the entire sliding window in
memory - So we cannot compute the variance exactly at
every instant - We will solve this problem approximately.
- We use memory and provide an
estimate with relative error of at most e - The time required per new element is amortized
O(1)
6Extend to k-median
- Given a multiset X of objects in a metric space M
with distance function l the k-median problem is
to pick k points c1,,ck?M so as to minimize - where C(x) is the closest of c1,,ck to x.
- if C(x)ci then x is said to be assigned to ci
and l(x, ci) is called the assignment distance of
x - The objective function is the sum of the
assignment distances.
7Problem 2 (SWKM)
- Given a stream of points from a metric space M
with distance function l, window size N, and
parameter k, maintain at every instant t a median
set c1,,ck?M minimizing - where Xt is the multiset of N most recent
points at time t
8Exponential Histogram
- From last week Maintaining simple statistics
over sliding windows - The exponential histogram estimates a class of
aggregated functions over sliding windows - Their result applies to any function f satisfying
the following properties for all multisets X,Y
9Where EH goes wrong
- EH can estimate any function f defined over
windows which satisfies - Positive
- Polynomialy bounded
- Composable
- Weakly additive
- where Cf 1 is a constant
Weakly Additive condition not valid for
variance, k-medians
10Failure of Weak Additivity
Value
Variance of each bucket is small
Time
11The idea
- Summarize intervals of the data stream using
composable synopses - For efficient memory use adjacent intervals are
combined, when it doesnt increase the error
significantly - The synopsis of the last interval in the sliding
window is inaccurate. Some points have expired - HOWEVER
- We will find a way to estimate this interval
12Timestamp
- Corresponds to the position of an active data
element in the current window - We do not make explicit updates
- We use a wraparound counter of logN bits
- Timestamp can be extracted by comparison with the
counter value of the current arrival
13Model
- We store the data elements in the buckets of the
histogram - Every bucket stores the synopsis structure for a
contiguous set of elements - The partition is based on arrival time
- The bucket also has a timestamp, of the most
recent data element in it - When the timestamp reaches N1 we drop the bucket
14Model
- Buckets are numbered B1,,Bm
- B1 the most recent
- Bm the oldest
- t1,,tm denote the bucket timestamp
- All buckets but Bm have only active data elements
15Maintaining variance over sliding windows
algorithm
16Details
- We would like to estimate the variance with
relative error of at most e - Maintain for each bucket Bi, besides its
timestamp ti, also - number of elements ni
- mean µi
- variance Vi
17Details
- Define another set of buckets B1,, Bj that
represent the suffixes of the data stream. - The bucket Bm represents all the points that
arrived after the oldest non-expired bucket - The statistics for these buckets are computed
temporarily
18data structure exponential histogram
Window size N
timestamp
X1
XN
X2
most recent
oldest
timestamp
oldest
most recent
B2
Bm
B1
Bm-1
Bm
19Combination rule
- In the algorithm we will need to combine adjacent
buckets - Consider two buckets Bi and Bj that get combined
to form a new bucket Bij - The statistics for Bij are
20Lemma 1
- The bucket combination procedure correctly
computes ni,j, µi,j, Vi,j for the new bucket
Proof
- Note that ni,j, µi,j, are correctly computed from
the definitions of count and average - Define diµi,-µi,j djµj,-µi,j
21(No Transcript)
22Main Solution Idea
- More careful estimation of last buckets
contribution - Decompose variance into two parts
- Internal variance within bucket
- External variance between buckets
23Estimation of the variance over the current
active window
- Let Bm refer to the non-expired portion of the
bucket Bm (the set of active elements) - The estimation for nm, µm, Vm
- nmESTN1-tm (exact)
- µmEST µm
- Vm EST Vm/2
- The statistics for Bm,m are sufficient for
computing the variance at time t.
24Estimation of the variance over the current
active window
- The estimate for Bm can be found in o(1) time if
we keep statistics for Bm - The error is due to the error in the estimation
statistics for Bm - Theorem Relative error e, provided Vm
(e2/9) Vm - Aim Maintain Vm (e2/9) Vm using as few
buckets as possible
25Algorithm sketch
- for every new element
- insert the new element
- to an existing bucket or to a new bucket
- if Bms timestamp gt N delete it
- if there are two adjacent buckets with small
combined variance combine them to one bucket
26Algorithm 1 (insert xt)
- 1. if xtµ1 then insert xt to B1, by
incrementing n1 by 1. Otherwise, create a new
bucket for xt. The new bucket becomes B1 with
v10 µ1 xt, n1 1. An old bucket Bi becomes
Bi1. - 2. if Bms timestampgtN, delete the bucket.
Bucket Bm-1 becomes the new oldest bucket.
Maintain the statistics of Bm-1 (instead of
Bm), which can be computed using the previously
maintained statistics for Bm and Bm-1.
(deletion of buckets also works)
27Algorithm 1 (insert xt)
- 3. Let k9/e2 and Vi,i-1 is the variance
combination of buckets Bi and Bi-1. While there
exist an index igt2 such that kVi,i-1Vi-1 find
the smallest i and combine the buckets according
to the combination rule. The statistics for Bi
can be computed incrementally from the statistics
for Bi-1 and Bi-1 - 4. Output estimated variance at time t
according to the estimation procedure. ? Vm,m
28- Invariant 1 For every bucket Bi, 9/e2Vi Vi
- Ensures that the relative error is e
- Invariant 2 For each ilt1, for every bucket Bi,
9/e2Vi,i-1 gt Vi-1 - This invariant insures that the total number of
buckets is small ? O((1/e2)log NR2) - Each bucket requires constant space
29Lemma 2
- The number of buckets maintained at any point in
time by an algorithm that preserves Invariant 2
is - O(1/e2logNR2 )
- where R is an upper bound on the absolute value
of the data elements.
30Proof sketch
- From the combination rule the variance of the
union of two buckets is no less then the sum of
the individual variances. - Algorithm that preserves invariant 2, the
variance of the suffix bucket Bi doubles after
every O(1/e2) buckets. - Total number of buckets no more then O(1/e2
logV) where V is the variance of the last N
points. V is no more than NR2. ? O(1/e2 log NR2)
31Running time improvement
- The algorithm requires O(1/e2logNR2 ) time per
new element. - Most time is spent in step 3 where we make the
sweep to combine buckets. - The time is proportional to the size of the
histogram O(1/e2logNR2 ). - The trick skip step 3 until we have seen
T(1/e2logNR2 ). - This ensures that the time of the algorithm is
amortized O(1). - May violate invariant 2 temporarily, but we
restore it every T(1/e2logNR2 ) data points, when
we execute step 3.
32Variance algorithm summery
- O(1/e2logNR2 ) time per new element
- O(1/e2 log NR2) memory
- with error of at most e
33Clustering on sliding windows
34Clustering Data Streams
- Based on k-median problem
- Data stream points from metric space.
- Find k clusters in the stream such that the sum
of distances from data points to their closest
center is minimized
35Clustering Data Streams
- Constant factor approximation algorithms
- A simple two step algorithm
- step1 For each set of Mnt points, Si,
find O(k) centers in S1, , SM - -- Local clustering Assign each
point in Si to its closest center - step2 Let S be centers for S1, ,
SM with each center weighted by
number of points assigned to it. - Cluster S to find k
centers - The solution cost is lt 2optimal solution cost
- tlt0.5 is a parameter which trades off space bound
with approximation factor of 2O(1/t)
36One-pass algorithm first phase
37One-pass algorithm second phase
38Restate the algorithm
input data stream
Repeat 1/t times
Nt points
level-0 medians
find O(k) mediansstore it with weight discard Nt
points
level-1 medians
Nt medians with associated weight
find O(k) medians
level-2 medians
39The idea
- In general, whenever there are nt medians at
level i they are clustered to form level (i1)
medians.
level-(i1) medians
level-i medians
data points
40data structure exponential histogram
- each bucket consists of a collection of data
points or intermediate medians.
41Point representation
- Each point is represented by a triple
- (p(x),w(x), c(x)).
- p(x) - identifier of x (coordinate)
- w(x) - weight of x, the number of points it
represents - c(x) - cost of x. An estimate of the sum of costs
l(x,y) of all the leaves y in the tree which x is
the root of x. - w(x) S(w(y1), w(y2),,w(yi))
- c(x) S(c(y)w(y)?l(x,y)) ,for all y assigned
to x - if x is a level-0 median w(x)
1, c(x)0 - Thus, c(x) is an overestimate of the true cost
of x -
42Bucket cost function
- We maintain medians at intermediate levels
- When ever there are Nt medians at the same level
we cluster them into O(k) medians at the next
higher level - Each bucket can be spilt into 1/t groups
- where each contains medians at level
j. - Each group contains at most Nt medians
43Bucket cost function
- Buckets cost function is an estimate of the cost
of clustering the points represented by the
bucket. - Consider bucket Bi. Let be
the set of medians in the bucket - Cost function for Bi
-
- Where C(x) ?c1,ck is the
median closest to x
44Combination
- Let Bi and Bj be two adjacent buckets that need
to be combined to form Bi,j - Let
be the groups of medians from the two buckets.
Set - if then cluster the points from
and set it to be empty. - C0 set of O(k) medians obtained by clustering
- and so on After
at most 1/t unions we get Bi,j - Now we compute the new buckets cost
45Answer a query
- Consider buckets B1Bm-1
- Each contain at most 1/t Nt medians, all contain
at most 1/t Nt medians - Cluster them to produce k medians
- Cluster bucket Bm to get k additional medians
- Present the 2k medians as the answer
46algorithm Insert xt
- if number of level-0 medians in B1ltk, add
the point xt as a level-0 median in bucket B1.
else create a new bucket B1 to contain xt
and renumber the existing buckets
accordingly. - if bucket Bms time stamp gt N, delete it
now, Bm-1 becomes the last bucket. - Make a sweep over the buckets from most recent to
least recent while there exists an index igt2
such that f(Bi,i-1)2f(Bi-1), find the smallest
such i and combine buckets Bi and Bi-1 using the
combination procedure described above.
47- Invariant 3. For every bucket Bi
- f(Bi)2f(Bi)
- Ensures a solution with 2k median whose cost is
within multiplicative factor of 2O(1/t) of the
cost of the optimal k-median solution. - Invariant 4. For every bucket Bi (igt1),
- f(Bi,i-1)gt2f(Bi-1)
- Ensures that the number of buckets never exceeds
O(1/tlogN) - We assume that cost is bounded by poly(N)
- ?O(1/tlogN) in the article
48Running time improvement
- After each element arrives we check if invariant
3 holds. - In order to reduce time we can execute bucket
combination only after some amount of points
accumulated in bucket B1, Only after it fills we
check for the invariant. - We assume that the algorithm is not called after
each new entry. Instead, it maintains enough
statistics to produce statistics when a query
arrives.
49Producing exactly k clusters
- With each median, we estimate within a constant
factor the number of active data points that are
assigned to it. - We dont cluster Bm and Bm separately but
cluster the medians from all the buckets
together. However the weights of medians form Bm
are adjusted so that they reflect only the active
data points.
50Conclusions
- The goal of such algorithms is to maintain
statistics or information for the last N set of
entries that is growing over real time. - The variance algorithm uses O(1/e2logNR2) memory
and maintains an estimate of the variance with
relative error of at most e and amortized O(1)
time per new element - The k-median algorithm provides a 2O(1/t)
approximation for tlt0.5. It uses O(1/tlogN)
memory and requires O(1) amortized time per new
element.
51Questions?
- More questions/comments can be sent to
anatrapo_at_post.tau.ac.il