Title: Sketching Asynchronous Streams Over a Sliding Window
1Sketching Asynchronous Streams Over a Sliding
Window
- Srikanta Tirthapura (Iowa State University)
- Bojian Xu (Iowa State University)
- Costas Busch (Rensselaer Polytechnic
Institute)
2Data Stream Processing
- Example I All packets on a network link,
maintain the number of different ip sources in
the last one hour - Example II Large database, continuously maintain
- Frequency Moments
- Median of all the elements
- Processing Requirements
- One pass processing
- Small workspace poly-logarithmic in the size of
data - Fast processing time per element
- Approximate answers are ok
3Data Stream Model
- Data stream(v0,t0), (v1,t1), (v2,t2), ...
- vi observed value
- ti timestamp of creation
- Synchronous stream
- ti In ascending order
- Asynchronous stream
- ti No order guaranteed
4Why Asynchronous Data Streams?
Synchronous stream
Asynchronous stream
Network delay multi-path routing
Synchronous
Asynchronous
Synchronous
Merge w/o control
5Recent Elements
- More interested in elements with recent
timestamps - Example Network monitoring
Interesting within last 5 mins
129.186.9.17 1159 7/24/6 129.186.59.7 1112 7/23/6 129.186.13.9 1145 7/23/06 129.186.5.63 1201 7/24/6
Current time 1203 7/24/6
Not interesting out of last 5 mins
6Timestamp Sliding Window
- Timestamp sliding window over stream S
- c current time
- W window size
7Sliding Window - example
Current time17
Stream
5,2
19,7
7,8
22,8
5,6
Current window
Clock time
recent
old
8Sliding Window - example
Current time18
Stream
5,2
19,7
7,8
22,8
5,6
9,11
Current window
Clock time
recent
old
9Our Contributions
- First study of aggregate computation over recent
elements of an asynchronous data stream - Randomized algorithms for estimating the sum and
median over a sliding window of an asynchronous
stream - Workspace much smaller than size of window
- Fast processing time per item
- Distributed aggregation over the union of
asynchronous streams
10Outline
- Problem Sum of Recent Elements
- Intuition Algorithm
- Union of Streams
11Problem
- Network monitoring
- Current time 1203 7/24/6
Interesting within last 5 mins
129.186.9.17, 423, 1159 7/24/6 129.186.59.7, 32, 1112 7/23/6 129.186.13.9, 145, 1145 7/23/06 129.186.5.63, 101, 1201 7/24/6
Not interesting out of last 5 mins
12Sum Problem
- Given
- Data Stream S (v0,t0), (v1,t1), (v2,t2), ...
- Max sliding window size W
- User inputs e, d.
- Task For all w W, continuously maintain an
(e-d)-estimate of
An (e-d)-estimate for X is a random variable Y,
such that PrY-X gt eX lt d.
13Previous Work
- M. Datar, A. Gionis, P. Indyk, R. Motwani.
Maintaining stream statistics over sliding
windows. SIAM Journal on Computing,
31(6)17941813, 2002. - P. Gibbons and S. Tirthapura. Distributed
streams algorithms for sliding windows. Theory
of Computing Systems, 37457478, 2004.
14Algorithm for Sum
- Problem Estimate the sum of elements within
sliding window - Random Sampling
- Randomly sample elements of this set
- Compute sum of random sample
- Multiply by appropriate scaling factor
15Intuition I
- To estimate the size of a set, sample the
universe until enough elements chosen from set
-
Population
With Green Eyes
Sample
sample
With Green Eyes
Prob. pj
16Intuition II
- Maintain many samples of fixed-size
Prob. pj
Elements within the Sliding Window
- Each element is randomly selected into the
samples from higher level to lower level, until
it fails at some sample or the lowest sample is
reached. - Each sample keeps a most recent elements.
17Intuition III
- Items with larger values should have more weight
to be selected into the sample.
(3,7)
p1
1/2
(1.5,7)
1/4
(1,7)
(v,t)(3, 7)
Elements within the Sliding Window
1/8
Failed
1/16
1/2m
For element (v, t) If (vp1) ? insert (vp, t)
into the sample. (Deterministic insertion) If
(vplt1)? insert(1,t) into the sample w.p. vp.
(Random insertion)
18 Algorithm for Sum
Current Time
17
20
18
22
22
Stream
(2,15), (3,16), (2,12), (3,11), (2,19)
t0 -1
t1 -1
t2 -1
t3 -1
c 17, W10, c-W, c7, 17
19 Algorithm for Sum
Current Time
17
20
18
22
22
Stream
(2,15), (3,16), (2,12), (3,11), (2,19)
Deterministic insertion
(2,15)
t0 -1
(1,15)
t1 -1
Random insertion
(1,15)
t2 -1
t3 -1
c 17, W10, c-W, c7, 17
If (vp1) ? insert (vp, t) into the sample.
(Deterministic insertion) If (vplt1)? insert(1,t)
into the sample w.p. vp. (Random insertion)
20 Algorithm for Sum
Current Time
17
20
18
22
22
Stream
(2,15), (3,16), (2,12), (3,11), (2,19)
Deterministic insertion
(3,16)
(2,15)
t0 -1
(1.5,16)
(1,15)
t1 -1
Random insertion
(1,16)
(1,15)
t2 -1
t3 -1
(1,16)
c 18, W10, c-W, c8, 18
If (vp1) ? insert (vp, t) into the sample.
(Deterministic insertion) If (vplt1)? insert(1,t)
into the sample w.p. vp. (Random insertion)
21 Algorithm for Sum
Current Time
17
20
18
22
22
Stream
(2,15), (3,16), (2,12), (3,11), (2,19)
Deterministic insertion
(3,16)
(2,15)
(2,12)
t0 -1
(1.5,16)
(1,15)
(1,12)
t1 -1
(1,16)
(1,15)
t2 -1
t3 -1
(1,16)
c 20, W10, c-W, c10, 20
If (vp1) ? insert (vp, t) into the sample.
(Deterministic insertion) If (vplt1)? insert(1,t)
into the sample w.p. vp. (Random insertion)
22 Algorithm for Sum
Current Time
17
20
18
22
22
Stream
(2,15), (3,16), (2,12), (3,11), (2,19)
Out of current window
(3,16)
(2,15)
(2,12)
t0 -1
(1.5,16)
(1,15)
(1,12)
t1 -1
(1,16)
t2 -1
(1,15)
t3 -1
(1,16)
c 22, W10, c-W, c12, 22
23 Algorithm for Sum
Current Time
17
20
18
22
22
Stream
(2,15), (3,16), (2,12), (3,11), (2,19)
Deterministic insertion
(2,19)
(3,16)
(2,15)
t0 -1
(2,12)
(1,19)
(1.5,16)
(1,15)
t1 -1
(1,12)
(1,16)
(1,15)
t2 -1
t3 -1
(1,16)
c 22, W10, c-W, c12, 22
If (vp1) ? insert (vp, t) into the sample.
(Deterministic insertion) If (vplt1)? insert(1,t)
into the sample w.p. vp. (Random insertion)
24 Algorithm for Sum
Current Time
17
20
18
22
22
Stream
(2,15), (3,16), (2,12), (3,11), (2,19)
Deterministic insertion
(2,19)
(3,16)
(2,15)
t0 12
Largest timestamp of all the elements discarded
from the sample
(1,19)
(1.5,16)
(1,15)
t1 12
(1,16)
(1,15)
t2 -1
t3 -1
(1,16)
c 22, W10, c-W, c12, 22
If (vp1) ? insert (vp, t) into the sample.
(Deterministic insertion) If (vplt1)? insert(1,t)
into the sample w.p. vp. (Random insertion)
25 Algorithm for Sum
Current Time
17
20
18
22
22
Stream
(2,15), (3,16), (2,12), (3,11), (2,19)
(2,19)
(3,16)
(2,15)
t0 12
- c-W, c12,22
- Level 01 overflowed
- Use Level 2
(1,19)
(1.5,16)
(1,15)
t1 12
(1,16)
(1,15)
t2 -1
(1,16)
t3 -1
c 22, W10, c-W, c12, 22
26Algorithm Complexity
- Space complexity
- Time complexity
- Expected time for processing each item
- Worst case time for processing each item
- Time for answering a query
Vmax Upper bound of the sum of all items within
the sliding window m Upper bound of the value
of any single item.
27Union of Streams
Alice
Stream 1
Stream 1
Carol
Bob
Stream 2
Stream 2
sketch 1
Alice
Stream 1
Sketch forwarding reduces the message
complexity.
Carol
sketch 2
Stream 2
Bob
Sketch is Compact Lossless
28Union of Streams
Sketch of stream 1
(3,13)
(2,9)
(3,6)
Sketch of union of stream 12
(9,12)
(7,10)
(15,6)
Each sample keeps 3 most recent items.
Sketch of stream 2
29Proof
- Deterministic insertion Random insertion
0-1 random variables
Accurate portion
Hoeffding Bound
Error bounded
If (vp1) ? insert (vp, t) into the sample.
(Deterministic insertion) If (vplt1)? insert(1,t)
into the sample w.p. vp. (Random insertion)
30Conclusions
- Aggregates on a sliding window over asynchronous
streams - First algorithms for the sum and median
- Distributed aggregation over the union of
asynchronous streams
31Future Work
- Deterministic algorithm
- Lower bounds
32