Data Stream Algorithms Intro, Sampling, Entropy

About This Presentation

Title:

Data Stream Algorithms Intro, Sampling, Entropy

Description:

Data Stream Algorithms Intro, Sampling, Entropy Graham Cormode graham_at_research.att.com Outline Introduction to Data Streams Motivating examples and applications Data ... – PowerPoint PPT presentation

Number of Views:211

Avg rating:3.0/5.0

Slides: 114

Provided by: Gra855

Learn more at: http://dimacs.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Stream Algorithms Intro, Sampling, Entropy

1
Data Stream Algorithms Intro, Sampling, Entropy
Graham Cormode graham_at_research.att.com
2
Outline

Introduction to Data Streams
Motivating examples and applications
Data Streaming models
Basic tail bounds
Sampling from data streams
Sampling to estimate entropy

3
Data is Massive

Data is growing faster than our ability to store
or index it
There are 3 Billion Telephone Calls in US each
day, 30 Billion emails daily, 1 Billion SMS,
IMs.
Scientific data NASA's observation satellites
generate billions of readings each per day.
IP Network Traffic up to 1 Billion packets per
hour per router. Each ISP has many (hundreds)
routers!
Whole genome sequences for many species now
available each megabytes to gigabytes in size

4
Massive Data Analysis

Must analyze this massive data
Scientific research (monitor environment,
species)
System management (spot faults, drops, failures)
Customer research (association rules, new offers)
For revenue protection (phone fraud, service
abuse)
Else, why even measure this data?

5
Example Network Data

Networks are sources of massive data the
metadata per hour per router is gigabytes
Fundamental problem of data stream analysis Too
much information to store or transmit
So process data as it arrives one pass, small
space the data stream approach.
Approximate answers to many questions are OK, if
there are guarantees of result quality

6
IP Network Monitoring Application
Example NetFlow IP Session Data

24x7 IP packet/flow data-streams at network
elements
Truly massive streams arriving at rapid rates
ATT/Sprint collect 1 Terabyte of NetFlow data
each day
Often shipped off-site to data warehouse for
off-line analysis

7
Packet-Level Data Streams

Single 2Gb/sec link say avg packet size is
50bytes
Number of packets/sec 5 million
Time per packet 0.2 microsec
If we only capture header information per
packet src/dest IP, time, no. of bytes, etc.
at least 10bytes.
Space per second is 50Mb
Space per day is 4.5Tb per link
ISPs typically have hundreds of links!
Analyzing packet content streams order(s) of
magnitude harder

8
Network Monitoring Queries
Off-line analysis slow, expensive
Network Operations Center (NOC)
Peer
R3
R1
R2

EnterpriseNetworks
PSTN

DSL/Cable Networks

Extra complexity comes from limited space and
time
Will introduce solutions for these and other
problems

9
Streaming Data Questions

Network managers ask questions requiring us to
analyze the data
How many distinct addresses seen on the network?
Which destinations or groups use most bandwidth?
Find hosts with similar usage patterns?
Extra complexity comes from limited space and
time
Will introduce solutions for these and other
problems

10
Other Streaming Applications

Sensor networks
Monitor habitat and environmental parameters
Track many objects, intrusions, trend analysis
Utility Companies
Monitor power grid, customer usage patterns etc.
Alerts and rapid response in case of problems

11
Streams Defining Frequency Dbns.

We will consider streams that define frequency
distributions
E.g. frequency of packets from source A to source
B
This simple setting captures many of the core
algorithmic problems in data streaming
How many distinct (non-zero) values seen?
What is the entropy of the frequency
distribution?
What (and where) are the highest frequencies?
More generally, can consider streams that define
multi-dimensional distributions, graphs,
geometric data etc.
But even for frequency distributions, several
models are relevant

12
Data Stream Models

We model data streams as sequences of simple
tuples
Complexity arises from massive length of streams
Arrivals only streams
Example (x, 3), (y, 2), (x, 2) encodesthe
arrival of 3 copies of item x, 2 copies of y,
then 2 copies of x.
Could represent eg. packets on a network power
usage
Arrivals and departures
Example (x, 3), (y,2), (x, -2) encodes final
state of (x, 1), (y, 2).
Can represent fluctuating quantities, or measure
differences between two distributions

x y
x y
13
Approximation and Randomization

Many things are hard to compute exactly over a
stream
Is the count of all items the same in two
different streams?
Requires linear space to compute exactly
Approximation find an answer correct within some
factor
Find an answer that is within 10 of correct
result
More generally, a (1? ?) factor approximation
Randomization allow a small probability of
failure
Answer is correct, except with probability 1 in
10,000
More generally, success probability (1-?)
Approximation and Randomization (?,
?)-approximations

14
Basic Tools Tail Inequalities

General bounds on tail probability of a random
variable (probability that a random variable
deviates far from its expectation)
Basic Inequalities Let X be a random variable
with expectation ? and variance VarX. Then, for
any ?gt0

15
Tail Bounds

Markov Inequality
For a random variable Y which takes only
non-negative values.
PrY ? k ? E(Y)/k
(This will be lt 1 only for k gt E(Y))
Chebyshevs Inequality
For a random variable Y
PrY-E(Y) ? k ? Var(Y)/k2
Proof Set X (Y E(Y))2
E(X) E(Y2E(Y)22YE(Y)) E(Y2)E(Y)2-2E(Y)2
Var(Y)
So PrY-E(Y) ? k Pr(Y E(Y))2 ?
k2.
Using Markov ? E(Y E(Y))2/k2 Var(Y)/k2

16
Outline

Introduction to Data Streams
Motivating examples and applications
Data Streaming models
Basic tail bounds
Sampling from data streams
Sampling to estimate entropy

17
Sampling From a Data Stream

Fundamental prob sample m items uniformly from
stream
Useful approximate costly computation on small
sample
Challenge dont know how long stream is
So when/how often to sample?
Two solutions, apply to different situations
Reservoir sampling (dates from 1980s?)
Min-wise sampling (dates from 1990s?)

18
Reservoir Sampling

Sample first m items
Choose to sample the ith item (igtm) with
probability m/i
If sampled, randomly replace a previously sampled
item
Optimization when i gets large, compute which
item will be sampled next, skip over intervening
items. Vitter 85

19
Reservoir Sampling - Analysis

Analyze simple case sample size m 1
Probability ith item is the sample from stream
length n
Prob. i is sampled on arrival ? prob. i survives
to end

1/n

Case for m gt 1 is similar, easy to show uniform
probability
Drawbacks of reservoir sampling hard to
parallelize

20
Min-wise Sampling

For each item, pick a random fraction between 0
and 1
Store item(s) with the smallest random tag Nath
et al.04

0.391
0.908
0.291
0.555
0.619
0.273

Each item has same chance of least tag, so
uniform
Can run on multiple streams separately, then
merge

21
Sampling Exercises

What happens when each item in the stream also
has a weight attached, and we want to sample
based on these weights?
Generalize the reservoir sampling algorithm to
draw a single sample in the weighted case.
Generalize reservoir sampling to sample multiple
weighted items, and show an example where it
fails to give a meaningful answer.
Research problem design new streaming algorithms
for sampling in the weighted case, and analyze
their properties.

22
Outline

Introduction to Data Streams
Motivating examples and applications
Data Streaming models
Basic tail bounds
Sampling from data streams
Sampling to estimate entropy

23
Application of Sampling Entropy

Given a long sequence of characters
S lta1, a2, a3 amgt each aj ? 1 n
Let fi frequency of i in the sequence
Compute the empirical entropy
H(S) - ?i fi/m log fi/m - ?i pi log pi
Example S lt a, b, a, b, c, a, d, agt
pa 1/2, pb 1/4, pc 1/8, pd 1/8
H(S) ½ ¼ ? 2 1/8 ? 3 1/8 ? 3 7/4
Entropy promoted for anomaly detection in networks

24
Challenge

Goal approximate H(S) in space sublinear
(poly-log) in m (stream length), n (alphabet
size)
(?,?) approx answer is (1?)H(S) w/prob 1-?
Easy if we have O(n) space compute each fi
exactly
More challenging if n is huge, m is huge, and we
have only one pass over the input in order
(The data stream model)

25
Sampling Based Algorithm

Simple estimator
Randomly sample a position j in the stream
Count how many times aj appears subsequently r
Output X -(r log (r/m) (r-1) log((r-1)/m))
Claim Estimator is unbiased EX H(S)
Proof prob of picking j 1/m, sum telescopes
correctly
Variance of estimate is not too large VarX
O(log2 m)
Observe that X log m
VarX E(X EX)2 lt (max(X) min(X))2
O(log2 m)

26
Analysis of Basic Estimator

A general technique in data streams
Repeat in parallel an unbiased estimator with
bounded variance, take average of estimates to
improve variance
Var 1/k (Y1 Y2 ... Yk) 1/k VarY
By Chebyshev, need k repetitions to be
VarX/?2E2X
For entropy, this means space k
O(log2m/?2H2(S))
Problem for entropy when H(S) is very small?
Space needed for an accurate approx goes as 1/H2!

27
Low Entropy

But... what does a low entropy stream look like?
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabaaaaa
Very boring most of the time, we are only rarely
surprised
Can there be two frequent items?
aabababababababaababababbababababababa
No! Thats high entropy (¼ 1 bit / character)
Only way to get H(S) o(1) is to have only one
character with pi close to 1

28
Removing the frequent character

Write entropy as
-pa log pa (1-pa) H(S)
Where S stream S with all as removed
Can show
Doesnt matter if H(S) is small as pa is large,
additive error on H(S) ensures relative error on
(1-pa)H(S)
Relative error (1-pa) on pa gives relative error
on pa log pa
Summing both (positive) terms gives relative
error overall

29
Finding the frequency character

Ejecting a is easy if we know in advance what it
is
Can then compute pa exactly
Can find online deterministically
Assume pa gt 2/3 (if not, H(S) gt 0.9, and original
alg works)
Run a heavy hitters algorithm on the stream
(see later)
Modify analysis, find a and pa ? (1-pa)
But... how to also compute H(S) simultaneously
if we dont know a from the start... do we need
two passes?

30
Always have a back up plan...

Idea keep two samples to build our estimator
If at the end one of our samples is a, use the
other
How to do this and ensure uniform sampling?
Pick first sample with min-wise sampling
At end of the stream, if the sampled character
a, we want to sample from the stream ignoring
all as
This is just the character achieving the
smallest label distinct from the one that
achieves the smallest label
Can track information to do this in a single
pass, constant space

31
Sampling Two Tokens
B
C
D
B
B
B
A
A
A
A
A
C
Stream
0.627
0.549
0.228
0.366
0.770
0.191
0.408
Tags
0.202
0.173
0.082
0.217
0.815
Repeats
A
A
A

Assign tags, choose first token as before
Delete all occurrences of first token
Choose token with min remaining tag count
repeats
Implementation keep track of two triples
(min tag, corresponding token, number of repeats)

32
Putting it all together

Can combine all these pieces
Build an estimator based on tracking this
information, deciding whether there is a frequent
character or not
A more involved Chernoff bounds argument improves
number of repetitions of estimator from
O(?-2VarX/E2X) to O(?-2RangeX/EX) O(?-2
log m)
In O(?-2 log m log 1/?) space (words) we can
compute an (?,?) approximation to H(S) in a
single pass

33
Entropy Exercises

As a subroutine, we need to find an element that
occurs more than 2/3 of the time and estimate its
weight
How can we find a frequently occurring item?
How can we estimate its weight p with ?(1-p)
error?
Our algorithm uses O(?-2 log m log 1/?) space,
could this be improved or is it optimal (lower
bounds)?
Our algorithm updates each sampled pair for every
update, how quickly can we implement it?
(Research problem) What if there are multiple
distributed streams and we want to compute the
entropy of their union?

34
Outline

Introduction to Data Streams
Motivating examples and applications
Data Streaming models
Basic tail bounds
Sampling from data streams
Sampling to estimate entropy

35
Data Stream Algorithms Frequency Moments
Graham Cormode graham_at_research.att.com
36
Frequency Moments

Introduction to Frequency Moments and Sketches
Count-Min sketch for F? and frequent items
AMS Sketch for F2
Estimating F0
Extensions
Higher frequency moments
Combined frequency moments

37
Last Time

Introduced data streams and data stream models
Focus on a stream defining a frequency
distribution
Sampling to draw a uniform sample from the stream
Entropy estimation based on sampling

38
This Time Frequency Moments

Given a stream of updates, let fi be the number
of times that item i is seen in the stream
Define Fk of the stream as ?i (fi)k the kth
Frequency Moment
Space Complexity of the Frequency Moments by
Alon, Matias, Szegedy in STOC 1996 studied this
problem
Awarded Godel prize in 2005
Set the pattern for much streaming algorithms to
follow
Frequency moments are at the core of many
streaming problems

39
Frequency Moments

F0 count 1 if fi ? 0 number of distinct items
F1 length of stream, easy
F2 sum the squares of the frequencies self
join size
Fk related to statistical moments of the
distribution
F? (really lim k? ? Fk1/k) dominated by the
largest fk, finds the largest frequency
Different techniques needed for each one.
Mostly sketch techniques, which compute a certain
kind of random linear projection of the stream

40
Sketches

Not every problem can be solved with sampling
Example counting how many distinct items in the
stream
If a large fraction of items arent sampled,
dont know if they are all same or all different
Other techniques take advantage that the
algorithm can see all the data even if it cant
remember it all
(To me) a sketch is a linear transform of the
input
Model stream as defining a vector, sketch is
result of multiplying stream vector by an
(implicit) matrix

linear projection
41
Trivial Example of a Sketch
1 0 1 1 1 0 1 0 1
1 0 1 1 0 0 1 0 1

Test if two (asynchronous) binary streams are
equal d (x,y) 0 iff xy, 1 otherwise
To test in small space pick a random hash
function h
Test h(x)h(y) small chance of false positive,
no chance of false negative.
Compute h(x), h(y) incrementally as new bits
arrive (e.g. h(x) xiti mod p for random prime
p, and t lt p)
Exercise extend to real valued vectors in update
model

42
Frequency Moments

Introduction to Frequency Moments and Sketches
Count-Min sketch for F? and frequent items
AMS Sketch for F2
Estimating F0
Extensions
Higher frequency moments
Combined frequency moments

43
Count-Min Sketch

Simple sketch idea, can be used for as the basis
of many different stream mining tasks.
Model input stream as a vector x of dimension U
Creates a small summary as an array of w ? d in
size
Use d hash function to map vector entries to
1..w
Works on arrivals only and arrivals departures
streams

W
Array CMi,j
d
44
Count-Min Sketch Structure
j,c
dlog 1/?
w 2/?

Each entry in vector x is mapped to one bucket
per row.
Merge two sketches by entry-wise summation
Estimate xj by taking mink CMk,hk(j)
Guarantees error less than eF1 in size O(1/e log
1/d)
Probability of more error is less than 1-d

C, Muthukrishnan 04
45
Approximation of Point Queries

Approximate point query xj mink CMk,hk(j)
Analysis In k'th row, CMk,hk(j) xj Xk,j
Xk,j S xi hk(i) hk(j)
E(Xk,j) Si? j xiPrhk(i)hk(j) ?
Prhk(i)hk(k) Si xi e F1/2 by pairwise
independence of h
PrXk,j ? eF1 PrXk,j ? 2E(Xk,j) ? 1/2 by
Markov inequality
So, Prxj ? xj eF1 Pr? k. Xk,j gt eF1
?1/2log 1/d d
Final result with certainty xj ? xj and
with probability at least 1-d, xj lt xj e
F1

46
Applications of Count-Min to F?

Count-Min sketch lets us estimate fi for any i
(up to ?F1)
F? asks to find maxi fi
Slow way test every i after creating sketch
Faster way test every i after it is seen in the
stream, and remember largest estimated value
Alternate way
keep a binary tree over the domain of input
items, where each node corresponds to a subset
keep sketches of all nodes at same level
descend tree to find large frequencies,
discarding branches with low frequency

47
Count-Min Exercises

The median of a distribution is the item so that
the sum of the frequencies of lexicographically
smaller items is ½ F1. Use CM sketch to find the
(approximate) median.
Assume the input frequencies follow the Zipf
distribution so that the ith largest frequency
is ?(i-z) for zgt1. Show that CM sketch only
needs to be size ?-1/z to give same guarantee
Suppose we have arrival and departure streams
where the frequencies of items are allowed to be
negative. Extend CM sketch analysis to estimate
these frequencies (note, Markov argument no
longer works)
How to find the large absolute frequencies when
some are negative? Or in the difference of two
streams?

48
Frequency Moments

Introduction to Frequency Moments and Sketches
Count-Min sketch for F? and frequent items
AMS Sketch for F2
Estimating F0
Extensions
Higher frequency moments
Combined frequency moments

49
F2 estimation

AMS sketch (for Alon-Matias-Szegedy) proposed in
1996
Allows estimation of F2 (second frequency moment)
Used at the heart of many streaming and
non-streaming mining applications achieves
dimensionality reduction
Here, describe AMS sketch by generalizing CM
sketch.
Uses extra hash functions g1...glog 1/d 1...U?
1,-1
Now, given update (j,c), set CMk,hk(i)
cgk(j)

linear projection
AMS sketch
50
F2 analysis
j,c
d8log 1/?
w 4/?2

Estimate F2 mediank åi CMk,i2
Each rows result is åi g(i)2xi2 åh(i)h(j)
2 g(i) g(j) xi xj
But g(i)2 -12 12 1, and åi xi2 F2
g(i)g(j) has 1/2 chance of 1 or 1
expectation is 0

51
F2 Variance

Expectation of row estimate Rk åi CMk,i2 is
exactly F2
Variance of row k, VarRk, is an expectation
VarRk E (?buckets b (CMk,b)2 F2)2
Good exercise in algebra expand this sum and
simplify
Many terms are zero in expectation because of
terms like g(a)g(b)g(c)g(d) (degree at most 4)
Requires that hash function g is four-wise
independent it behaves uniformly over subsets of
size four or smaller
Such hash functions are easy to construct

52
F2 Variance

Terms with odd powers of g(a) are zero in
expectation
g(a)g(b)g2(c), g(a)g(b)g(c)g(d), g(a)g3(b)
Leaves VarRk ? ?i g4(i) xi4 2 ?j? i
g2(i) g2(j) xi2 xj2 4 ?h(i)h(j) g2(i)
g2(j) xi2 xj2 - (xi4 ?j? i 2xi2
xj2) ? F22/w
Row variance can finally be bounded by F22/w
Chebyshev for w4/?2 gives probability ¼ of
failure
How to amplify this to small ? probability of
failure?

53
Tail Inequalities for Sums

We derive stronger bounds on tail probabilities
for the sum of independent Bernoulli trials via
the Chernoff Bound
Let X1, ..., Xm be independent Bernoulli trials
s.t. PrXi1 p (PrXi0 1-p).
Let X ?i1m Xi ,and ? mp be the expectation
of X.
Then, for any ?gt0,

54
Applying Chernoff Bound

Each row gives an estimate that is within ?
relative error with probability p gt ¾
Take d repetitions and find the median. Why the
median?
Because bad estimates are either too small or too
large
Good estimates form a contiguous group in the
middle
At least d/2 estimates must be bad for median to
be bad
Apply Chernoff bound to d independent estimates,
p3/4
Pr More than d/2 bad estimates lt 2exp(d/8)
So we set d ?(ln ?) to give ? probability of
failure
Same outline used many times in data streams

55
Aside on Independence

Full independence is expensive in a streaming
setting
If hash functions are fully independent over n
items, then we need ?(n) space to store their
description
Pairwise and four-wise independent hash functions
can be described in a constant number of words
The F2 algorithm uses a careful mix of limited
and full independence
Each hash function is four-wise independent over
all n items
Each repetition is fully independent of all
others but there are only O(log 1/?)
repetitions.

56
AMS Sketch Exercises

Let x and y be binary streams of length n. The
Hamming distance H(x,y) i xi? yiShow
how to use AMS sketches to approximate H(x,y)
Extend for strings drawn from an arbitrary
alphabet
The inner product of two strings x, y is x ? y
?i1n xiyiUse AMS sketches to estimate x ?
y
Hint try computing the inner product of the
sketches.Show the estimator is unbiased (correct
in expectation)
What form does the error in the approximation
take?
Use Count-Min Sketches for the same problem and
compare the errors.
Is it possible to build a (1??) approximation of
x ? y?

57
Frequency Moments

Introduction to Frequency Moments and Sketches
Count-Min sketch for F? and frequent items
AMS Sketch for F2
Estimating F0
Extensions
Higher frequency moments
Combined frequency moments

58
F0 Estimation

F0 is the number of distinct items in the stream
a fundamental quantity with many applications
Early algorithms by Flajolet and Martin 1983
gave nice hashing-based solution
analysis assumed fully independent hash functions
Will describe a generalized version of the FM
algorithm due to Bar-Yossef et. al with only
pairwise indendence

59
F0 Algorithm

Let m be the domain of stream elements
Each item in stream is from 1m
Pick a random hash function h m ? m3
With probability at least 1-1/m, no collisions
under h
For each stream item i, compute h(i), and track
the t distinct items achieving the smallest
values of h(i)
Note if same i is seen many times, h(i) is same
Let vt tth smallest value of h(i) seen.
If F0 lt t, give exact answer, else estimate F0
tm3/vt
vt/m3 ? fraction of hash domain occupied by t
smallest

0m3
m3
60
Analysis of F0 algorithm

Suppose F0 tm3/vt gt (1?) F0 estimate is
too high
So for stream set S ? 2m, we have
s ? S h(s) lt tm3/(1?)F0 gt t
Because ? lt 1, we have tm3/(1?)F0 ?
(1-?/2)tm3/F0
Pr h(s) lt (1-?/2)tm3/F0 ? 1/m3 (1-?/2)tm3/F0
(1-?/2)t/F0
(this analysis outline hides some rounding
issues)

0m3
tm3/(1?)F0
vt
m3
61
Chebyshev Analysis

Let Y be number of items hashing to under
tm3/(1?)F0
EY F0 Pr h(s) lt tm3/(1?)F0 (1-?/2)t
For each item i, variance of the event p(1-p) lt
p
VarY ?s ? S Var h(s) lt tm3/(1?)F0 lt
(1-?/2)t
We sum variances because of pairwise independence
Now apply Chebyshev
Pr Y gt t ? PrY EY gt ?t/2 ?
4VarY/?2t2 lt 4t/(?2t2)
Set t20/?2 to make this Prob ? 1/5

62
Completing the analysis

We have shown Pr F0 gt (1?) F0 lt 1/5
Can show Pr F0 lt (1-?) F0 lt 1/5 similarly
too few items hash below a certain value
So Pr (1-?) F0 ? F0 ? (1?)F0 gt 3/5 Good
estimate
Amplify this probability repeat O(log 1/?) times
in parallel with different choices of hash
function h
Take the median of the estimates, analysis as
before

63
F0 Issues

Space cost
Store t hash values, so O(1/?2 log m) bits
Can improve to O(1/?2 log m) with additional
tricks
Time cost
Find if hash value h(i) lt vt
Update vt and list of t smallest if h(i) not
already present
Total time O(log 1/? log m) worst case

64
Range Efficiency

Sometimes input is specified as a stream of
ranges a,b
a,b means insert all items (a, a1, a2 b)
Trivial solution just insert each item in the
range
Range efficient F0 Pavan, Tirthapura 05
Start with an alg for F0 based on pairwise hash
functions
Key problem track which items hash into a
certain range
Dives into hash fns to divide and conquer for
ranges
Range efficient F2 Calderbank et al. 05,
Rusu,Dobra 06
Start with sketches for F2 which sum hash values
Design new hash functions so that range sums are
fast

65
F0 Exercises

Suppose the stream consists of a sequence of
insertions and deletions. Design an algorithm to
approximate F0 of the current set.
What happens when some frequencies are negative?
Give an algorithm to find F0 of the most recent W
arrivals
Use F0 algorithms to approximate Max-dominance
given a stream of pairs (i,x(i)), approximate ?i
max(i, x(i)) x(i)

66
Frequency Moments

Introduction to Frequency Moments and Sketches
Count-Min sketch for F? and frequent items
AMS Sketch for F2
Estimating F0
Extensions
Higher frequency moments
Combined frequency moments

67
Higher Frequency Moments

Fk for kgt2. Use sampling trick as with Entropy
Alon et al 96
Uniformly pick an item from the stream length 1n
Set r how many times that item appears
subsequently
Set estimate Fk n(rk (r-1)k)
EFk1/nn f1k - (f1-1)k (f1-1)k - (f1-2)k
1k-0k f1k f2k Fk
VarFk?1/nn2(f1k-(f1-1)k)2
Use various bounds to bound the variance by k
m1-1/k Fk2
Repeat k m1-1/k times in parallel to reduce
variance
Total space needed is O(k m1-1/k) machine words

68
Improvements

Coppersmith and Kumar 04 Generalize the F2
approach
E.g. For F3, set p1/?m, and hash items onto
1-1/p, -1/p with probability 1/p, 1-1/p
respectively.
Compute cube of sum of the hash values of the
stream
Correct in expectation, bound variance ? O(?mF32)
Indyk, Woodruff 05, Bhuvangiri et al. 06
Optimal solutions by extracting different
frequencies
Use hashing to sample subsets of items and fis
Combine these to build the correct estimator
Cost is O(m1-2/k poly-log(m,n,1/?)) space

69
Combined Frequency Moments
Consider network traffic data defines a
communication graph eg edge (source,
destination) or edge (sourceport,
destport) Defines a (directed) multigraph We are
interested in the underlying (support) graph on n
nodes

Want to focus on number of distinct communication
pairs, not size of communication
So want to compute moments of F0 values...

70
Multigraph Problems

Let Gi,j 1 if (i,j) appears in stream edge
from i to j. Total of m distinct edges
Let di Sj1n Gi,j degree of node i
Find aggregates of dis
Estimate heavy dis (people who talk to many)
Estimate frequency momentsnumber of distinct di
values, sum of squares
Range sums of dis (subnet traffic)

71
F? (F0) using CM-FM

Find is such that di gt f åi diFinds the people
that talk to many others
Count-Min sketch only uses additions, so can
apply

72
Accuracy for F?(F0)

Focus on point query accuracy estimate di.
Can prove estimate has only small bias in
expectation
Analysis is similar to original CM sketch
analysis, but now have to take account of F0
estimation of counts
Gives an bound of O(1/?3 poly-log(n)) space
The product of the size of the sketches
Remains to fully understand other combinations of
frequency moments, eg. F2(F0), F2(F2) etc.

73
Exercises / Problems

(Research problem) What can be computed for other
combinations of frequency moments, e.g. F2 of F2
values, etc.?
The F2 algorithm uses the fact that 1/-1 values
square to preserve F2 but are 0 in expectation.
Why wont it work to estimate F4 with h ? -1,
1, -i, i?
(Research problem) Read, understand and simplify
analysis for optimal Fk estimation algorithms
Take the sampling Fk algorithm and combine it
with F0 estimators to approximate Fk of node
degrees
Why cant we use the sketch approach for F2 of
node degrees? Show there the analysis breaks
down

74
Frequency Moments

Introduction to Frequency Moments and Sketches
Count-Min sketch for F? and frequent items
AMS Sketch for F2
Estimating F0
Extensions
Higher frequency moments
Combined frequency moments

75
Data Stream Algorithms Lower Bounds
Graham Cormode graham_at_research.att.com
76
Streaming Lower Bounds

Lower bounds for data streams
Communication complexity bounds
Simple reductions
Hardness of Gap-Hamming problem
Reductions to Gap-Hamming

1 0 1 1 1 0 1 0 1
77
This Time Lower Bounds

So far, have seen many examples of things we can
do with a streaming algorithm
What about things we cant do?
Whats the best we could achieve for things we
can do?
Will show some simple lower bounds for data
streams based on communication complexity

78
Streaming As Communication
1 0 1 1 1 0 1 0 1

Imagine Alice processing a stream
Then take the whole working memory, and send to
Bob
Bob continues processing the remainder of the
stream

79
Streaming As Communication

Suppose Alices part of the stream corresponds to
string x, and Bobs part corresponds to string
y...
...and that computing the function on the stream
corresponds to computing f(x,y)...
...then if f(x,y) has communication complexity
?(g(n)), then the streaming computation has a
space lower bound of ?(g(n))
Proof by contradiction If there was an
algorithm with better space usage, we could run
it on x, then send the memory contents as a
message, and hence solve the communication problem

80
Deterministic Equality Testing
1 0 1 1 1 0 1 0 1
1 0 1 1 0 0 1 0 1

Alice has string x, Bob has string y, want to
test if xy
Consider a deterministic (one-round, one-way)
protocol that sends a message of length m lt n
There are 2m possible messages, so some strings
must generate the same message this would cause
error
So a deterministic message (sketch) must be ?(n)
bits
In contrast, we saw a randomized sketch of size
O(log n)

81
Hard Communication Problems

INDEX x is a binary string of length ny is an
index in nGoal output xyResult (one-way)
(randomized) communication complexity of INDEX is
?(n) bits
DISJOINTNESS x and y are both length n binary
strings Goal Output 1 if ?i xiyi1, else
0Result (multi-round) (randomized)
communication complexity of DISJOINTNESS is ?(n)
bits

82
Simple Reduction to Disjointness
x 1 0 1 1 0 1
1, 3, 4, 6
y 0 0 0 1 1 0
4, 5

F? output the highest frequency in a stream
Input the two strings x and y from disjointness
Stream if xi1, then put i in stream then
same for y
Analysis if F?2, then intersection if F??1,
then disjoint.
Conclusion Giving exact answer to F? requires
?(N) bits
Even approximating up to 50 error is hard
Even with randomization DISJ bound allows
randomness

83
Simple Reduction to Index
x 1 0 1 1 0 1
1, 3, 4, 6
y 5
5

F0 output the number of items in the stream
Input the strings x and index y from INDEX
Stream if xi1, put i in stream then put y in
stream
Analysis if (1-?)F0(x?y)gt(1?)F0(x) then
xy1, else it is 0
Conclusion Approximating F0 for ?lt1/N requires
?(N) bits
Implies that space to approximate must be ?(1/?)
Bound allows randomization

84
Hardness Reduction Exercises

Use reductions to DISJ or INDEX to show the
hardness of
Frequent items find all items in the stream
whose frequency gt ?N, for some ?.
Sliding window given a stream of binary (0/1)
values, compute the sum of the last N values
Can this be approximated instead?
Min-dominance given a stream of pairs (i,x(i)),
approximate ?i min(i, x(i)) x(i)
Rank sum Given a stream of (x,y) pairs and query
(p,q) specified after stream, approximate
(x,y) xltp, yltq

85
Streaming Lower Bounds

Lower bounds for data streams
Communication complexity bounds
Simple reductions
Hardness of Gap-Hamming problem
Reductions to Gap-Hamming

1 0 1 1 1 0 1 0 1
86
Gap Hamming

GAP-HAMM communication problem
Alice holds x ? 0,1N, Bob holds y ? 0,1N
Promise H(x,y) is either ? N/2 - pN or ? N/2
pN
Which is the case?
Model one message from Alice to Bob
Requires ?(N) bits of one-way randomized
communication
Indyk, Woodruff03, Woodruff04, Jayram, Kumar,
Sivakumar 07

87
Hardness of Gap Hamming

Reduction to an instance of INDEX
Map string x to u by 1? 1, 0 ? -1 (i.e. ui
2xi -1 )
Assume both Alice and Bob have access to public
random strings rj, where each bit of rj is iid
-1, 1
Assume w.l.o.g. that length of string n is odd
(important!)
Alice computes aj sign(rj ? u)
Bob computes bj sign(rjy)
Repeat N times with different random strings, and
consider the Hamming distance of a1... aN with b1
... bN

88
Probability of a Hamming Error

Consider the pair aj sign(rj ? u), bj
sign(rjy)
Let w ?i ? y ui rji
w is a sum of (n-1) values distributed iid
uniform -1,1
Case 1 w ? 0. So w? 2, since (n-1) is even
so sign(aj) sign(w), independent of xy
Then Praj ? bj Prsign(w) ? sign(rjy) ½
Case 2 w 0. So aj sign(rj?u) sign(w
uyrjy) sign(uyrjy)
Then Praj ? bj Prsign(uyrjy)
sign(rjy)
This probability is 1 is uy1, 0 if uy-1
Completely biased by the answer to INDEX

89
Finishing the Reduction

So what is Prw0?
w is sum of (n-1) iid uniform -1,1 values
Textbook Prw0 c/?n, for some constant c
Do some probability manipulation
Praj bj ½ c/2?n if xy1
Praj bj ½ - c/2?n if xy0
Amplify this bias by making strings of length
N4n/c2
Apply Chernoff bound on N instances
With probgt2/3, either H(a,b)gtN/2 ?N or
H(a,b)ltN/2 - ?N
If we could solve GAP-HAMMING, could solve INDEX
Therefore, need ?(N) ?(n) bits for GAP-HAMMING

90
Streaming Lower Bounds

Lower bounds for data streams
Communication complexity bounds
Simple reductions
Hardness of Gap-Hamming problem
Reductions to Gap-Hamming

1 0 1 1 1 0 1 0 1
91
Lower Bound for Entropy

Alice x ? 0,1N, Bob y ? 0,1N
Entropy estimation algorithm A
Alice runs A on enc(x) ?(1,x1), (2,x2), ,
(N,xN)?
Alice sends over memory contents to Bob
Bob continues A on enc(y) ?(1,y1), (2,y2), ,
(N,yN)?

1
1
0
0
1
0
Alice
Bob
0
1
0
0
1
1
92
Lower Bound for Entropy

Observe there are
2H(x,y) tokens with frequency 1 each
N-H(x,y) tokens with frequency 2 each
So, H(S) log N H(x,y)/N
Thus size of Alices memory contents ?(N).
Set ? 1/(p(N) log N) to show bound of ?(?/log
1/?)-2)

1
1
0
0
1
0
Alice
Bob
0
1
0
0
1
1
93
Lower Bound for F0

Same encoding works for F0 (Distinct Elements)
2H(x,y) tokens with frequency 1 each
N-H(x,y) tokens with frequency 2 each
F0(S) N H(x,y)
Either H(x,y)gtN/2 ?N or H(x,y)ltN/2 - ?N
If we could approximate F0 with ? lt 1/?N, could
separate
But space bound ?(N) ?(?-2) bits
Dependence on ? for F0 is tight
Similar arguments show ?(?-2) bounds for Fk,
Proof assumes k (and hence 2k) are constants

94
Lower Bounds Exercises

Formally argue the space lower bound for F2 via
Gap-Hamming
Argue space lower bounds for Fk via Gap-Hamming
(Research problem) Extend lower bounds for the
case when the order of the stream is random or
near-random
(Research problem) Kumar conjectures the
multi-round communication complexity of
Gap-Hamming is ?(n) this would give lower
bounds for multi-pass streaming

95
Streaming Lower Bounds

Lower bounds for data streams
Communication complexity bounds
Simple reductions
Hardness of Gap-Hamming problem
Reductions to Gap-Hamming

1 0 1 1 1 0 1 0 1
96
Data Stream Algorithms Extensions and Open
Problems
Graham Cormode graham_at_research.att.com
97
This Time Extensions

Have given the basics of streaming streams of
items, frequency moments, upper and lower bounds
Many variations with many open problems
Streams representing different combinatorial
objects
Streams that are distributed, correlated,
uncertain
Systems for processing streams
Different models of streams
See also Open problems in Data Streams
McGregor 07
Result of a workshop held at IIT Kanpur in Dec
2006

98
Deterministic Streaming Algorithms

Focus so far has been on randomized algorithms
Many important problems can be solved
deterministically!
Finding frequent items/ heavy hitters
Finding quantiles of a distribution
For many problems, lower bounds show
randomization is necessary for sublinear space
Anything involving equality testing as a special
case
Frequency moments
When they are possible, deterministic algorithms
are often faster and use less space more
practical to implement

99
Clustering On Data Streams

Goal output k cluster centers at end - any
point can be classified using these centers.
Use divide and conquer approach Guha et al.
00
Buffer as many points as possible, then cluster
them
Cluster the clusters
Cluster the cluster clusters, etc...
Each level of clustering gives up extra factors
in quality

Input
Output
100
Geometric Streaming

Stream specifies a sequence of d-dimensional
points
Answer various geometric problems such as
Convex hull
Minimum spanning tree weight
Facility location
Minimum enclosing ball
Gridding approach reduces to Fk or related
problems Indyk 03
Core-set keep a carefully chosen small subset of
points and evaluate on them Har-Peled 02,
Chan06
Simple example For minimum enclosing ball, keep
extremal points in evenly-space directions

101
Sliding Window Computations

In a sliding window, we only consider the last W
items
W still very large, so want poly-log(W) solutions
Exponential Histograms Datar et al.02 and
Waves Gibbons Tirthapura02
Deterministic structure tracks counts in a window
Based on doubling bucket sizes to give relative
error
Same structure sketches solves for aggregates
Asynchronous streams items not in timestamp
order
Relative error counts possible Busch, Tirthapura
07
Extend concept to other aggregates C. et al. 08

102
Time Decay

Assign a weight to each item as a function of its
age
E.g. Exponential decay or polynomial decay
Implies weighted versions of problems
Cohen and Strauss 2003
Can reduce sum and counts to multiple instances
of sliding window queries
C., Korn and Tirthapura 2008
Same observations applies to othercomputations
(quantiles, frequent items)

age ?
103
Multi-Pass Algorithms

Some situations allow multiple passes of the
stream
E.g. scanning over slow storage (tape) random
access not possible, but can scan multiple times
Earliest work in streaming Munro, Paterson 78
studied the pass/space tradeoff for finding
medians
Lower bounds can follow from multi-round
communication complexity bounds

1 0 1 1 1 0 1 0 1
104
Other Massive Data Models

Massive Unordered Data (MUD) model Feldman et
al. 08
Abstracts computations in MapReduce/Hadoop
settings
Can provably simulate deterministic streaming
algs
What about randomized computations, multiple
passes?

105
Skewed Streams

In practice, not all frequency distributions are
worst case
Few items are frequent, then a long tail of
infrequent items
Such skew is prevalent in network data, word
frequency, paper citations, city sizes, etc.
Zipfian distribution with skew z gt 0 (z
1..2 typical)
Analyze algorithms under assumption of skewed
data
Improved F2 space cost O(e-2/(1z) log 1/d),
provided zgt1

106
Graph Streaming
(4,5) (2,3) (1,3) (3,5) (1,2) (2,4) (1,5) (3,4)

Stream specifies a massive graph edge by edge
Most natural problems have ?(V) space lower
bounds
Semi-streaming model allow ?(V) but o(E)
spaceTherefore also o(V2) space also
Allow one (or few) passes to approximate
Minimum Spanning Tree Weight
Graph Distances (based on spanners)
Maximum weight matching
Counting Triangles

107
Matrix Streaming

Stream specifies a massive n ? n matrix
Either by giving entries in some order, or
updates to entries
In one (or few) passes, find
CUR Decomposition
Page Rank Vector
Approximate Matrix product
Singular Vector Decomposition
Current methods take small constant number of
passes, sample constant number of rows and
columns by weight
Sketching methods dont seem so useful here

O(1) Columns
O(1) Rows
Carefully chosen U
108
Permutation Streaming

Stream presents a permutation of items
Abstraction of several settings, more of
theoretic interest
Approximate number of inversions in the stream
Locations where i gt j but i appears before j in
stream
Can be reduced to a variation of quantiles
Gupta, Zane03
Find length of longest increasing subsequence
Reduce (up to factor 2) to simpler function
Ergun, Jowhari 08
Approximate this using a different variation of
quantiles
Deterministic lower bound ?(N1/2), randomized
bound open

109
Random Order Streaming

Lower bounds are sometimes based on carefully
creating adversarial orders of streams
Random order streams order is uniformly permuted
Can sometimes give much better upper bounds
prefix of stream gives a good sample of dbn. to
come
Lower bounds in random order give stronger
evidence of robust hardness, e.g. Chakrabarti
et al. 08
Hardness via communication complexity of random
partitions
GAP-HAMMING still has linear lower bound
t2-party DISJOINTNESS has ?(n/t) lower bound

110
Probabilistic Streams
Example S (?x, ½?, ?y, 1/3?, ?y, ¼?) Encodes
6 possible worlds
G ? x y x,y y,y x,y,y
PrG ¼ ¼ 5/24 5/24 1/24 1/24

Instead of exact values, stream of discrete
distributions
Specify exponentially many possible worlds
Adds complexity to previously studied problems
Sum and Count are easy (by linearity of
expectation)
AvgSum/Count is hard! because of ratio
McGregor et al. 07
Linearity of expectation, summation of variance
Allows estimation of Fk over streams C,
Garofalakis 07

111
Distributed Streams

Motivated by Sensor Networks large wireless
nets
Communication drains battery compute more, send
less
Key problem design stream summary data
structures that can be combined to summarize the
union of streams
Most sketches (AMS, Count-Min, F0) naturally
distribute
Similar results needed for other problems

http//www.intel.com/research/exploratory/motes.ht
m
base station (root, coordinator)
112
Continuous Distributed Model

Goal Continuously track (global) query over
streams at the coordinator while bounding the
communication
Large-scale network-event monitoring, real-time
anomaly/ DDoS attack detection, power grid
monitoring,
Results known for quantiles, Fk, clustering...
Cost not much higher than one time computation C
et al. 08

113
Extensions for P2P Networks