Maintaining Variance and k-Medians over Data Stream Windows - PowerPoint PPT Presentation

About This Presentation

Title:

Maintaining Variance and k-Medians over Data Stream Windows

Description:

Maintaining Variance and k-Medians over Data Stream Windows ... Each bucket can be spilt into 1/t groups. where each contains medians at level j. ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 52

Provided by: mathT

Category:

more less

Transcript and Presenter's Notes

Title: Maintaining Variance and k-Medians over Data Stream Windows

1
Maintaining Variance and k-Medians over Data
Stream Windows

Paper by Brian Babcock, Mayur Datar, Rajeev
Motwani and Liadan OCallaghan.

Presentation by Anat Rapoport December 2003.
2
Characteristics of the data stream

Data elements arrive continually
Only the most recent N elements are used when
answering queries
Single linear scan algorithm (can only have one
look)
Store only the summery of the data seen thus far.

3
Introduction

Two important and related problems
Variance
K-median clustering

4
Problem 1 (Variance)

Given a stream of numbers, maintain at every
instant the variance of the last N values
where denotes the mean of the
last N values

5
Problem 1 (Variance)

We cannot buffer the entire sliding window in
memory
So we cannot compute the variance exactly at
every instant
We will solve this problem approximately.
We use memory and provide an
estimate with relative error of at most e
The time required per new element is amortized
O(1)

6
Extend to k-median

Given a multiset X of objects in a metric space M
with distance function l the k-median problem is
to pick k points c1,,ck?M so as to minimize
where C(x) is the closest of c1,,ck to x.
if C(x)ci then x is said to be assigned to ci
and l(x, ci) is called the assignment distance of
x
The objective function is the sum of the
assignment distances.

7
Problem 2 (SWKM)

Given a stream of points from a metric space M
with distance function l, window size N, and
parameter k, maintain at every instant t a median
set c1,,ck?M minimizing
where Xt is the multiset of N most recent
points at time t

8
Exponential Histogram

From last week Maintaining simple statistics
over sliding windows
The exponential histogram estimates a class of
aggregated functions over sliding windows
Their result applies to any function f satisfying
the following properties for all multisets X,Y

9
Where EH goes wrong

EH can estimate any function f defined over
windows which satisfies
Positive
Polynomialy bounded
Composable
Weakly additive
where Cf 1 is a constant

Weakly Additive condition not valid for
variance, k-medians
10
Failure of Weak Additivity
Value
Variance of each bucket is small
Time
11
The idea

Summarize intervals of the data stream using
composable synopses
For efficient memory use adjacent intervals are
combined, when it doesnt increase the error
significantly
The synopsis of the last interval in the sliding
window is inaccurate. Some points have expired
HOWEVER
We will find a way to estimate this interval

12
Timestamp

Corresponds to the position of an active data
element in the current window
We do not make explicit updates
We use a wraparound counter of logN bits
Timestamp can be extracted by comparison with the
counter value of the current arrival

13
Model

We store the data elements in the buckets of the
histogram
Every bucket stores the synopsis structure for a
contiguous set of elements
The partition is based on arrival time
The bucket also has a timestamp, of the most
recent data element in it
When the timestamp reaches N1 we drop the bucket

14
Model

Buckets are numbered B1,,Bm
B1 the most recent
Bm the oldest
t1,,tm denote the bucket timestamp
All buckets but Bm have only active data elements

15
Maintaining variance over sliding windows
algorithm
16
Details

We would like to estimate the variance with
relative error of at most e
Maintain for each bucket Bi, besides its
timestamp ti, also
number of elements ni
mean µi
variance Vi

17
Details

Define another set of buckets B1,, Bj that
represent the suffixes of the data stream.
The bucket Bm represents all the points that
arrived after the oldest non-expired bucket
The statistics for these buckets are computed
temporarily

18
data structure exponential histogram
Window size N

timestamp
X1
XN
X2
most recent
oldest

timestamp
oldest
most recent

B2
Bm
B1
Bm-1
Bm
19
Combination rule

In the algorithm we will need to combine adjacent
buckets
Consider two buckets Bi and Bj that get combined
to form a new bucket Bij
The statistics for Bij are

20
Lemma 1

The bucket combination procedure correctly
computes ni,j, µi,j, Vi,j for the new bucket

Proof

Note that ni,j, µi,j, are correctly computed from
the definitions of count and average
Define diµi,-µi,j djµj,-µi,j

21
(No Transcript)
22
Main Solution Idea

More careful estimation of last buckets
contribution
Decompose variance into two parts
Internal variance within bucket
External variance between buckets

23
Estimation of the variance over the current
active window

Let Bm refer to the non-expired portion of the
bucket Bm (the set of active elements)
The estimation for nm, µm, Vm
nmESTN1-tm (exact)
µmEST µm
Vm EST Vm/2
The statistics for Bm,m are sufficient for
computing the variance at time t.

24
Estimation of the variance over the current
active window

The estimate for Bm can be found in o(1) time if
we keep statistics for Bm
The error is due to the error in the estimation
statistics for Bm
Theorem Relative error e, provided Vm
(e2/9) Vm
Aim Maintain Vm (e2/9) Vm using as few
buckets as possible

25
Algorithm sketch

for every new element
insert the new element
to an existing bucket or to a new bucket
if Bms timestamp gt N delete it
if there are two adjacent buckets with small
combined variance combine them to one bucket

26
Algorithm 1 (insert xt)

1. if xtµ1 then insert xt to B1, by
incrementing n1 by 1. Otherwise, create a new
bucket for xt. The new bucket becomes B1 with
v10 µ1 xt, n1 1. An old bucket Bi becomes
Bi1.
2. if Bms timestampgtN, delete the bucket.
Bucket Bm-1 becomes the new oldest bucket.
Maintain the statistics of Bm-1 (instead of
Bm), which can be computed using the previously
maintained statistics for Bm and Bm-1.
(deletion of buckets also works)

27
Algorithm 1 (insert xt)

3. Let k9/e2 and Vi,i-1 is the variance
combination of buckets Bi and Bi-1. While there
exist an index igt2 such that kVi,i-1Vi-1 find
the smallest i and combine the buckets according
to the combination rule. The statistics for Bi
can be computed incrementally from the statistics
for Bi-1 and Bi-1
4. Output estimated variance at time t
according to the estimation procedure. ? Vm,m

Invariant 1 For every bucket Bi, 9/e2Vi Vi
Ensures that the relative error is e
Invariant 2 For each ilt1, for every bucket Bi,
9/e2Vi,i-1 gt Vi-1
This invariant insures that the total number of
buckets is small ? O((1/e2)log NR2)
Each bucket requires constant space

29
Lemma 2

The number of buckets maintained at any point in
time by an algorithm that preserves Invariant 2
is
O(1/e2logNR2 )
where R is an upper bound on the absolute value
of the data elements.

30
Proof sketch

From the combination rule the variance of the
union of two buckets is no less then the sum of
the individual variances.
Algorithm that preserves invariant 2, the
variance of the suffix bucket Bi doubles after
every O(1/e2) buckets.
Total number of buckets no more then O(1/e2
logV) where V is the variance of the last N
points. V is no more than NR2. ? O(1/e2 log NR2)

31
Running time improvement

The algorithm requires O(1/e2logNR2 ) time per
new element.
Most time is spent in step 3 where we make the
sweep to combine buckets.
The time is proportional to the size of the
histogram O(1/e2logNR2 ).
The trick skip step 3 until we have seen
T(1/e2logNR2 ).
This ensures that the time of the algorithm is
amortized O(1).
May violate invariant 2 temporarily, but we
restore it every T(1/e2logNR2 ) data points, when
we execute step 3.

32
Variance algorithm summery

O(1/e2logNR2 ) time per new element
O(1/e2 log NR2) memory
with error of at most e

33
Clustering on sliding windows
34
Clustering Data Streams

Based on k-median problem
Data stream points from metric space.
Find k clusters in the stream such that the sum
of distances from data points to their closest
center is minimized

35
Clustering Data Streams

Constant factor approximation algorithms
A simple two step algorithm
step1 For each set of Mnt points, Si,
find O(k) centers in S1, , SM
-- Local clustering Assign each
point in Si to its closest center
step2 Let S be centers for S1, ,
SM with each center weighted by
number of points assigned to it.
Cluster S to find k
centers
The solution cost is lt 2optimal solution cost
tlt0.5 is a parameter which trades off space bound
with approximation factor of 2O(1/t)

36
One-pass algorithm first phase
37
One-pass algorithm second phase
38
Restate the algorithm
input data stream
Repeat 1/t times
Nt points
level-0 medians
find O(k) mediansstore it with weight discard Nt
points
level-1 medians
Nt medians with associated weight
find O(k) medians
level-2 medians
39
The idea

In general, whenever there are nt medians at
level i they are clustered to form level (i1)
medians.

level-(i1) medians
level-i medians
data points
40
data structure exponential histogram

each bucket consists of a collection of data
points or intermediate medians.

41
Point representation

Each point is represented by a triple
(p(x),w(x), c(x)).
p(x) - identifier of x (coordinate)
w(x) - weight of x, the number of points it
represents
c(x) - cost of x. An estimate of the sum of costs
l(x,y) of all the leaves y in the tree which x is
the root of x.
w(x) S(w(y1), w(y2),,w(yi))
c(x) S(c(y)w(y)?l(x,y)) ,for all y assigned
to x
if x is a level-0 median w(x)
1, c(x)0
Thus, c(x) is an overestimate of the true cost
of x

42
Bucket cost function

We maintain medians at intermediate levels
When ever there are Nt medians at the same level
we cluster them into O(k) medians at the next
higher level
Each bucket can be spilt into 1/t groups
where each contains medians at level
j.
Each group contains at most Nt medians

43
Bucket cost function

Buckets cost function is an estimate of the cost
of clustering the points represented by the
bucket.
Consider bucket Bi. Let be
the set of medians in the bucket
Cost function for Bi
Where C(x) ?c1,ck is the
median closest to x

44
Combination

Let Bi and Bj be two adjacent buckets that need
to be combined to form Bi,j
Let
be the groups of medians from the two buckets.
Set
if then cluster the points from
and set it to be empty.
C0 set of O(k) medians obtained by clustering
and so on After
at most 1/t unions we get Bi,j
Now we compute the new buckets cost

45
Answer a query

Consider buckets B1Bm-1
Each contain at most 1/t Nt medians, all contain
at most 1/t Nt medians
Cluster them to produce k medians
Cluster bucket Bm to get k additional medians
Present the 2k medians as the answer

46
algorithm Insert xt

if number of level-0 medians in B1ltk, add
the point xt as a level-0 median in bucket B1.
else create a new bucket B1 to contain xt
and renumber the existing buckets
accordingly.
if bucket Bms time stamp gt N, delete it
now, Bm-1 becomes the last bucket.
Make a sweep over the buckets from most recent to
least recent while there exists an index igt2
such that f(Bi,i-1)2f(Bi-1), find the smallest
such i and combine buckets Bi and Bi-1 using the
combination procedure described above.

Invariant 3. For every bucket Bi
f(Bi)2f(Bi)
Ensures a solution with 2k median whose cost is
within multiplicative factor of 2O(1/t) of the
cost of the optimal k-median solution.
Invariant 4. For every bucket Bi (igt1),
f(Bi,i-1)gt2f(Bi-1)
Ensures that the number of buckets never exceeds
O(1/tlogN)
We assume that cost is bounded by poly(N)
?O(1/tlogN) in the article

48
Running time improvement

After each element arrives we check if invariant
3 holds.
In order to reduce time we can execute bucket
combination only after some amount of points
accumulated in bucket B1, Only after it fills we
check for the invariant.
We assume that the algorithm is not called after
each new entry. Instead, it maintains enough
statistics to produce statistics when a query
arrives.

49
Producing exactly k clusters

With each median, we estimate within a constant
factor the number of active data points that are
assigned to it.
We dont cluster Bm and Bm separately but
cluster the medians from all the buckets
together. However the weights of medians form Bm
are adjusted so that they reflect only the active
data points.

50
Conclusions

The goal of such algorithms is to maintain
statistics or information for the last N set of
entries that is growing over real time.
The variance algorithm uses O(1/e2logNR2) memory
and maintains an estimate of the variance with
relative error of at most e and amortized O(1)
time per new element
The k-median algorithm provides a 2O(1/t)
approximation for tlt0.5. It uses O(1/tlogN)
memory and requires O(1) amortized time per new
element.

51
Questions?