Title: Distributed Streams Algorithms for Sliding Windows
1Distributed Streams Algorithms for Sliding Windows
- Phillip B. Gibbons,
- Srikanta Tirthapura
2Abstract
- Algorithm for estimating aggregate functions over
a sliding window of the N most recent data
items in one or more streams.
3Single stream
- The first E-approximation scheme for number of
1s in a sliding window. - The first E-approximation scheme for the sum of
integers in 0..R in a sliding window. - Both algorithms are optimal in worst case time
and space. - Both algorithms are deterministic
4Distributed Streams
- The first randomized E-approximation scheme for
the number of 1s in a sliding window over the
union of distributed streams.
5Usage
- Network Monitoring
- Data Warehousing
- Telecommunications
- Sensor Networks
6- Multiple Data Source - Distributed Stream Model
- Only the most recent data is important - Sliding
Window
7The Goal in the algorithms
- Approximating a function F while minimizing
- 1. The total memory
- 2. The time take by each party to process a data
item - 3. The time to produce an estimate - query time
8Definition 1-An -approximation scheme
for a quantity X
- A randomized procedure that, given any positive
lt1 and lt1, compute an estimate - -approximate An estimate whose worst case
relative error is at most
9An Example for Basic Counting Problem
10Algorithms for Distributed Stream
- Each party observes only its own stream
- Each party communicates with other parties only
when estimate is requested - Each party sends a message to a Referee who
computes the estimate
11The Idea
- Storing a wave consisting of many random samples
of the stream. - Samples that contain only the recent items are
sampled at a high probability, while those
containing old items are sampled at a lower
probability
12Contributions
- Introducing a data structures called waves
- Presenting the first E-approximation scheme for
Basic Counting. - Presenting the first E-approximation scheme for
the sum of integers in 0..R. Both optimal in
worst case space, processing time and query time.
13Contributions
- Presenting the first randomized
-approximation for the number of 1s in a
sliding window over the union of distributed
streams
14Related Work
- From the paper of Datar et al
- Using Exponential Histogram data base
15Exponential Histogram
- Maintain more information about recently seen
items, less about old items. - k0 most recent 1s are assigned to individual
bucket - The K1 next most recent 1s are assigned to
bucket size 2. - The K2 next most recent 1s are assigned to
bucket size 4. - So on until last N items are assigned to some
bucket
16Exponential Histogram
- Each ki is either or
- The last bucket is discarded if its position no
longer falls within the window - If the new item is a 1, it is assigned to a new
bucket of size 1. - If this make , then the two
least recent buckets of size 1 are merged to form
a bucket of size 2. - If k1 in now too large, the two least recent
buckets of size 2 are merged - So on resulting in a cascading of up to log N
bucket merges in the worst case. - The approach using waves avoids this cascading
17The Basic Wave
- Assumption is an integer.
- Counters 1. pos - the current length of
stream2. rank - the current number of 1s in the
stream. - The wave contains the position of the recent 1s
in the stream, arranged at different levels. - For i1,2,..,l-1, level i contains the positions
of the most recent 1-bits whose 1-rank is a
multiple of
18An Example for Basic Wave
- The crest of the wave is always over the largest
1-rank - N48, 1/E3, l5
19Estimation Steps
- Let smax(0,pos-n1) estimation number of 1s
in s,pos - Let p1 be the maximum position less than s, and
p2 the minimum position greater/equal then s. - Let r1 and r2 be the rank-1 of p1 and p2
respectively. - Return rank-r1 where r r2 if r2-r1 1
otherwise r(r1r2)/2
20LEMMA 1
- The procedure returns an estimate that is
within a relative error of E of the actual number
of 1s in the window.
21Proof
- Let j be the smallest numbered level containing
position p1. - By returning the midpoint of the range r1,r2 ,
we guarantee that the absolute error is at most
(r2-r1)/2 - There is at most a gap between r1 and its next
larger position r2. - Thus the absolute error in our estimate is at
most - Let r3 be the earliest 1-rank at level j-1.
- r3gt r1, r3gtr2.
- by definition
22Improvement
- Use modulo N counters for pos and rank, store
the positions in the wave as modulo N numbers -
Take only log N bits. - Keep track of both the largest 1-rank discarded
(r1) and the smallest 1-rank (r2) still in the
wave - Number of 1s answer in O(1). - Instead of storing a single position in multiple
levels, store each position only at its maximal
level.
23Improvement
24Improvement
- The positions at each level are stored in a fixed
length queue so that each time new position is
added , the position at the end of the queue is
removed. - Maintaining a doubly link list of the position in
the wave in increasing order. - By storing the difference between consecutive
positions instead of the absolute positions -
reduce the space from to
25The deterministic wave algorithm
- Upon receiving a stream bit b1.Increment pos
(modulo N2N)2.If the head(p,r) of the linked
list L has expired (pltpos-N), then discard it
from L and from its queue, and store r as the
largest 1-rank discarded - 3.If b1 then do(a)Increment rank, and
determine the corresponding wave level j, the
largest j such that rank is a multiple of (b)If
the level j queue is full,discard the tail of the
queue and splice it out of L(c)Add(pos,rank) to
the head of the level j queue and the tail of L
26Answering a query for a sliding window of size N
- 1. Let r1 the largest 1-rank discarded. (If no
such r1, return rank as exact answer.) Let r2 be
1-rank at the head of the linked list L. (If L is
empty, return 0). - 2. Return rank-r1, where rr2 if r2-r11
and otherwise r(r1r2)/2
27- Space -
- Process time for each item - O(1)
- Estimate time - O(1)
- In related work (Datar et al)
- Space -
- Process time for each item - O(log(EN))
28Sum of Bounded Integers
- The sum over a sliding window can range from 0 to
NR. - Let N be smallest power of 2 greater than/equal
to 2RN. - Counters(modulo N)pos - the current
lengthtotal - the running sum - llog(2ENR) levels.
- Storing triple for each item (p,v,z)v-the value
for the data itemz-the partial sum trough this
item
29- The answer for query is the midpoint of the
interval total-z2v2,total-z1)
30The Algorithm for the sum of last N items in a
data stream
- Upon receiving a stream value v between 0 to R
- 1.Increment pos (modulo N2N)
- 2.If the head(p,v,z) of the linked list L has
expired (pltpos-N), then discard it from L and
from its queue, and store z as the largest
partial sum discarded - 3.If vgt0 then do
- (a)Determine the largest j such that some number
in (total,totalv) is a multiple of Add v to
total. - (b)If the level j queue is full,discard the tail
of the queue and splice it out of L - (c)Add(pos,v,total) to the head of the level j
queue and the tail of L
31Step 3a
- The desired wave level is the largest position j
such that some number y in the interval
(total,totalv has 0s in all positions less
than j. - y-1 and y differ in bit position j.
- If bit j changes from 1 to 0 at any point in
total,totalv,then j is not the largest - j is the position of the most-significant bit
that is 0 in total and 1 in totalv. - j is the most -significant bit that is 1 in
bitwise xor between total and totalv
32Answering a query for a sliding window of size N
- 1. Let z1 be the largest partial sum discarded
from L. (If no such z1, return total as exact
answer.) Let (pos,v2,z2) be the head of the
linked list L. (If L is empty, return 0). - 2. Return total - (z1z2-v2)/2
33- Space -O(1/E(logNlogR)) memory word of
O(logNlogR) - Process time for each item - O(1)
- Estimate time - O(1)
- In related work (Datar et al)
- Space - O(1/E(logNlogR)) buckets of
logNlog(logNlogR) - Process time for each item - O(logNlogR)
34Distributed Streams
- Tree definitions for sliding window over a
collection of tgt1 distributed stream1. Seeking
the total number of 1s in the last N items in
each of the t streams (tN items in total)2. A
single logical stream has been split arbitrarily
among the parties. Each party receives items that
include a sequence number in the logical stream.
Seeking the total number of 1s in the last N
items in the logical stream.3.Seeking the total
number of 1s in the last N items in the
position-wise union of the t streams
35Solution for First Scenario
- Applying single stream algorithm to each stream.
- To answer a query, each party sends its count to
the Referee. - The Referee sums the answers.
- Because each individual count is within E
relative error, so is the total.
36Solution for Second Scenario
- To answer a query, each party sends its wave to
the Referee. - The Referee computes the maximum sequence number
over all the parties use each wave to obtain an
estimate over the resulting window, and sum the
result. - Because each individual count is within E
relative error, so is the total.
37Randomized Waves
- Contains the positions of the recent 1s in the
data stream, stored at different levels. - Each level i contains the most recently selected
positions of the 1-bits, where a position is
selected into level i with probability - The deterministic wave select 1 out of every
1-bits at regular interval. - A randomized wave selects an expected 1 out of
every 1-bits random interval. - The randomize wave retains more position per
level.
38The Basic Randomized Wave
- Let N be the power of 2 that is at least 2N
- Let dlogN
- Let Elt1 be the desired error probability
- Each Party Pj maintains a basic randomized wave
for its stream consisting of d1 queues,
Qj(0),..,Qj(d), one for each level. - Using a psedo-random hash function h to map
positions to levels, according to exponential
distribution
39The Steps for Maintaining the Randomized Wave
- Party Pj, upon receiving a stream bit
b1.Increment pos (modulo N2N)2.Discard any
position p in the tail of a queue that has
expired (pltpos-N)3.If b1 then for l
0,..,h(pos) do(a) If the level l queue Qj(l) is
full, then discard the tail of Qj(l)(b) Add pos
to the head of Qj(l). - The sample for each level, stored in a queue,
contains the most recent position selected
into the level. (c36)
40- Consider a queue Qj(l) contains all the 1-bitwise
the interval I,pos whose position i. Then Qj(l)
contains all the 1-bits in the interval i,pos
whose positions hash to a value greater than
equal to l. - As we move from level l to l1, the range may
increase. - The queues at lower numbered levels may have
ranges that fail to contain the window, but as we
move to higher levels, we will find a level whose
contains the window
41Answering a query for a sliding window of size
nltN
- After each party has observed pos bits1. Each
party j sends its wave, Qj(0),..,Qj(logN)), to
the Referee, let smax(0,pos-n1). Then Ws,pos
is the desired window.2.For j1,..,t let lj be
the minimum level such that the tail of Qj(lj) is
a position plts.3.Let lmaxlj,j0,..,t. Let U
be the union of all positions in
Q1(l),..Qt(l).4. Return
42- The algorithm returns an estimate for Union
Counting Problem for any sliding window of size
nltN that is within a relative error E with
probability greater than 2/3 - space -