Maintaining Variance and k-Medians over Data Stream Windows - PowerPoint PPT Presentation

About This Presentation
Title:

Maintaining Variance and k-Medians over Data Stream Windows

Description:

Maintaining Variance and k-Medians over Data Stream Windows ... Each bucket can be spilt into 1/t groups. where each contains medians at level j. ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 52
Provided by: mathT
Category:

less

Transcript and Presenter's Notes

Title: Maintaining Variance and k-Medians over Data Stream Windows


1
Maintaining Variance and k-Medians over Data
Stream Windows
  • Paper by Brian Babcock, Mayur Datar, Rajeev
    Motwani and Liadan OCallaghan.

Presentation by Anat Rapoport December 2003.
2
Characteristics of the data stream
  • Data elements arrive continually
  • Only the most recent N elements are used when
    answering queries
  • Single linear scan algorithm (can only have one
    look)
  • Store only the summery of the data seen thus far.

3
Introduction
  • Two important and related problems
  • Variance
  • K-median clustering

4
Problem 1 (Variance)
  • Given a stream of numbers, maintain at every
    instant the variance of the last N values
  • where denotes the mean of the
    last N values

5
Problem 1 (Variance)
  • We cannot buffer the entire sliding window in
    memory
  • So we cannot compute the variance exactly at
    every instant
  • We will solve this problem approximately.
  • We use memory and provide an
    estimate with relative error of at most e
  • The time required per new element is amortized
    O(1)

6
Extend to k-median
  • Given a multiset X of objects in a metric space M
    with distance function l the k-median problem is
    to pick k points c1,,ck?M so as to minimize
  • where C(x) is the closest of c1,,ck to x.
  • if C(x)ci then x is said to be assigned to ci
    and l(x, ci) is called the assignment distance of
    x
  • The objective function is the sum of the
    assignment distances.

7
Problem 2 (SWKM)
  • Given a stream of points from a metric space M
    with distance function l, window size N, and
    parameter k, maintain at every instant t a median
    set c1,,ck?M minimizing
  • where Xt is the multiset of N most recent
    points at time t

8
Exponential Histogram
  • From last week Maintaining simple statistics
    over sliding windows
  • The exponential histogram estimates a class of
    aggregated functions over sliding windows
  • Their result applies to any function f satisfying
    the following properties for all multisets X,Y

9
Where EH goes wrong
  • EH can estimate any function f defined over
    windows which satisfies
  • Positive
  • Polynomialy bounded
  • Composable
  • Weakly additive
  • where Cf 1 is a constant

Weakly Additive condition not valid for
variance, k-medians
10
Failure of Weak Additivity
Value
Variance of each bucket is small
Time
11
The idea
  • Summarize intervals of the data stream using
    composable synopses
  • For efficient memory use adjacent intervals are
    combined, when it doesnt increase the error
    significantly
  • The synopsis of the last interval in the sliding
    window is inaccurate. Some points have expired
  • HOWEVER
  • We will find a way to estimate this interval

12
Timestamp
  • Corresponds to the position of an active data
    element in the current window
  • We do not make explicit updates
  • We use a wraparound counter of logN bits
  • Timestamp can be extracted by comparison with the
    counter value of the current arrival

13
Model
  • We store the data elements in the buckets of the
    histogram
  • Every bucket stores the synopsis structure for a
    contiguous set of elements
  • The partition is based on arrival time
  • The bucket also has a timestamp, of the most
    recent data element in it
  • When the timestamp reaches N1 we drop the bucket

14
Model
  • Buckets are numbered B1,,Bm
  • B1 the most recent
  • Bm the oldest
  • t1,,tm denote the bucket timestamp
  • All buckets but Bm have only active data elements

15
Maintaining variance over sliding windows
algorithm
16
Details
  • We would like to estimate the variance with
    relative error of at most e
  • Maintain for each bucket Bi, besides its
    timestamp ti, also
  • number of elements ni
  • mean µi
  • variance Vi

17
Details
  • Define another set of buckets B1,, Bj that
    represent the suffixes of the data stream.
  • The bucket Bm represents all the points that
    arrived after the oldest non-expired bucket
  • The statistics for these buckets are computed
    temporarily

18
data structure exponential histogram
Window size N

timestamp
X1
XN
X2
most recent
oldest

timestamp
oldest
most recent

B2
Bm
B1
Bm-1
Bm
19
Combination rule
  • In the algorithm we will need to combine adjacent
    buckets
  • Consider two buckets Bi and Bj that get combined
    to form a new bucket Bij
  • The statistics for Bij are

20
Lemma 1
  • The bucket combination procedure correctly
    computes ni,j, µi,j, Vi,j for the new bucket

Proof
  • Note that ni,j, µi,j, are correctly computed from
    the definitions of count and average
  • Define diµi,-µi,j djµj,-µi,j

21
(No Transcript)
22
Main Solution Idea
  • More careful estimation of last buckets
    contribution
  • Decompose variance into two parts
  • Internal variance within bucket
  • External variance between buckets

23
Estimation of the variance over the current
active window
  • Let Bm refer to the non-expired portion of the
    bucket Bm (the set of active elements)
  • The estimation for nm, µm, Vm
  • nmESTN1-tm (exact)
  • µmEST µm
  • Vm EST Vm/2
  • The statistics for Bm,m are sufficient for
    computing the variance at time t.

24
Estimation of the variance over the current
active window
  • The estimate for Bm can be found in o(1) time if
    we keep statistics for Bm
  • The error is due to the error in the estimation
    statistics for Bm
  • Theorem Relative error e, provided Vm
    (e2/9) Vm
  • Aim Maintain Vm (e2/9) Vm using as few
    buckets as possible

25
Algorithm sketch
  • for every new element
  • insert the new element
  • to an existing bucket or to a new bucket
  • if Bms timestamp gt N delete it
  • if there are two adjacent buckets with small
    combined variance combine them to one bucket

26
Algorithm 1 (insert xt)
  • 1. if xtµ1 then insert xt to B1, by
    incrementing n1 by 1. Otherwise, create a new
    bucket for xt. The new bucket becomes B1 with
    v10 µ1 xt, n1 1. An old bucket Bi becomes
    Bi1.
  • 2. if Bms timestampgtN, delete the bucket.
    Bucket Bm-1 becomes the new oldest bucket.
    Maintain the statistics of Bm-1 (instead of
    Bm), which can be computed using the previously
    maintained statistics for Bm and Bm-1.
    (deletion of buckets also works)

27
Algorithm 1 (insert xt)
  • 3. Let k9/e2 and Vi,i-1 is the variance
    combination of buckets Bi and Bi-1. While there
    exist an index igt2 such that kVi,i-1Vi-1 find
    the smallest i and combine the buckets according
    to the combination rule. The statistics for Bi
    can be computed incrementally from the statistics
    for Bi-1 and Bi-1
  • 4. Output estimated variance at time t
    according to the estimation procedure. ? Vm,m

28
  • Invariant 1 For every bucket Bi, 9/e2Vi Vi
  • Ensures that the relative error is e
  • Invariant 2 For each ilt1, for every bucket Bi,
    9/e2Vi,i-1 gt Vi-1
  • This invariant insures that the total number of
    buckets is small ? O((1/e2)log NR2)
  • Each bucket requires constant space

29
Lemma 2
  • The number of buckets maintained at any point in
    time by an algorithm that preserves Invariant 2
    is
  • O(1/e2logNR2 )
  • where R is an upper bound on the absolute value
    of the data elements.

30
Proof sketch
  • From the combination rule the variance of the
    union of two buckets is no less then the sum of
    the individual variances.
  • Algorithm that preserves invariant 2, the
    variance of the suffix bucket Bi doubles after
    every O(1/e2) buckets.
  • Total number of buckets no more then O(1/e2
    logV) where V is the variance of the last N
    points. V is no more than NR2. ? O(1/e2 log NR2)

31
Running time improvement
  • The algorithm requires O(1/e2logNR2 ) time per
    new element.
  • Most time is spent in step 3 where we make the
    sweep to combine buckets.
  • The time is proportional to the size of the
    histogram O(1/e2logNR2 ).
  • The trick skip step 3 until we have seen
    T(1/e2logNR2 ).
  • This ensures that the time of the algorithm is
    amortized O(1).
  • May violate invariant 2 temporarily, but we
    restore it every T(1/e2logNR2 ) data points, when
    we execute step 3.

32
Variance algorithm summery
  • O(1/e2logNR2 ) time per new element
  • O(1/e2 log NR2) memory
  • with error of at most e

33
Clustering on sliding windows
34
Clustering Data Streams
  • Based on k-median problem
  • Data stream points from metric space.
  • Find k clusters in the stream such that the sum
    of distances from data points to their closest
    center is minimized

35
Clustering Data Streams
  • Constant factor approximation algorithms
  • A simple two step algorithm
  • step1 For each set of Mnt points, Si,
    find O(k) centers in S1, , SM
  • -- Local clustering Assign each
    point in Si to its closest center
  • step2 Let S be centers for S1, ,
    SM with each center weighted by
    number of points assigned to it.
  • Cluster S to find k
    centers
  • The solution cost is lt 2optimal solution cost
  • tlt0.5 is a parameter which trades off space bound
    with approximation factor of 2O(1/t)

36
One-pass algorithm first phase
37
One-pass algorithm second phase
38
Restate the algorithm
input data stream
Repeat 1/t times
Nt points
level-0 medians
find O(k) mediansstore it with weight discard Nt
points
level-1 medians
Nt medians with associated weight
find O(k) medians
level-2 medians
39
The idea
  • In general, whenever there are nt medians at
    level i they are clustered to form level (i1)
    medians.

level-(i1) medians
level-i medians
data points
40
data structure exponential histogram
  • each bucket consists of a collection of data
    points or intermediate medians.

41
Point representation
  • Each point is represented by a triple
  • (p(x),w(x), c(x)).
  • p(x) - identifier of x (coordinate)
  • w(x) - weight of x, the number of points it
    represents
  • c(x) - cost of x. An estimate of the sum of costs
    l(x,y) of all the leaves y in the tree which x is
    the root of x.
  • w(x) S(w(y1), w(y2),,w(yi))
  • c(x) S(c(y)w(y)?l(x,y)) ,for all y assigned
    to x
  • if x is a level-0 median w(x)
    1, c(x)0
  • Thus, c(x) is an overestimate of the true cost
    of x

42
Bucket cost function
  • We maintain medians at intermediate levels
  • When ever there are Nt medians at the same level
    we cluster them into O(k) medians at the next
    higher level
  • Each bucket can be spilt into 1/t groups
  • where each contains medians at level
    j.
  • Each group contains at most Nt medians

43
Bucket cost function
  • Buckets cost function is an estimate of the cost
    of clustering the points represented by the
    bucket.
  • Consider bucket Bi. Let be
    the set of medians in the bucket
  • Cost function for Bi
  • Where C(x) ?c1,ck is the
    median closest to x

44
Combination
  • Let Bi and Bj be two adjacent buckets that need
    to be combined to form Bi,j
  • Let
    be the groups of medians from the two buckets.
    Set
  • if then cluster the points from
    and set it to be empty.
  • C0 set of O(k) medians obtained by clustering
  • and so on After
    at most 1/t unions we get Bi,j
  • Now we compute the new buckets cost

45
Answer a query
  • Consider buckets B1Bm-1
  • Each contain at most 1/t Nt medians, all contain
    at most 1/t Nt medians
  • Cluster them to produce k medians
  • Cluster bucket Bm to get k additional medians
  • Present the 2k medians as the answer

46
algorithm Insert xt
  • if number of level-0 medians in B1ltk, add
    the point xt as a level-0 median in bucket B1.
    else create a new bucket B1 to contain xt
    and renumber the existing buckets
    accordingly.
  • if bucket Bms time stamp gt N, delete it
    now, Bm-1 becomes the last bucket.
  • Make a sweep over the buckets from most recent to
    least recent while there exists an index igt2
    such that f(Bi,i-1)2f(Bi-1), find the smallest
    such i and combine buckets Bi and Bi-1 using the
    combination procedure described above.

47
  • Invariant 3. For every bucket Bi
  • f(Bi)2f(Bi)
  • Ensures a solution with 2k median whose cost is
    within multiplicative factor of 2O(1/t) of the
    cost of the optimal k-median solution.
  • Invariant 4. For every bucket Bi (igt1),
  • f(Bi,i-1)gt2f(Bi-1)
  • Ensures that the number of buckets never exceeds
    O(1/tlogN)
  • We assume that cost is bounded by poly(N)
  • ?O(1/tlogN) in the article

48
Running time improvement
  • After each element arrives we check if invariant
    3 holds.
  • In order to reduce time we can execute bucket
    combination only after some amount of points
    accumulated in bucket B1, Only after it fills we
    check for the invariant.
  • We assume that the algorithm is not called after
    each new entry. Instead, it maintains enough
    statistics to produce statistics when a query
    arrives.

49
Producing exactly k clusters
  • With each median, we estimate within a constant
    factor the number of active data points that are
    assigned to it.
  • We dont cluster Bm and Bm separately but
    cluster the medians from all the buckets
    together. However the weights of medians form Bm
    are adjusted so that they reflect only the active
    data points.

50
Conclusions
  • The goal of such algorithms is to maintain
    statistics or information for the last N set of
    entries that is growing over real time.
  • The variance algorithm uses O(1/e2logNR2) memory
    and maintains an estimate of the variance with
    relative error of at most e and amortized O(1)
    time per new element
  • The k-median algorithm provides a 2O(1/t)
    approximation for tlt0.5. It uses O(1/tlogN)
    memory and requires O(1) amortized time per new
    element.

51
Questions?
  • More questions/comments can be sent to
    anatrapo_at_post.tau.ac.il
Write a Comment
User Comments (0)
About PowerShow.com