Title: Compact Representations in Streaming Algorithms
1Compact Representations in Streaming Algorithms
- Moses CharikarPrinceton University
2Talk Outline
- Statistical properties of data streams
- Distinct elements
- Frequency moments, norm estimation
- Frequent items
3Frequency MomentsAlon, Matias, Szegedy 99
- Stream consists of elements from 1,2,,n
- mi number of times i occurs
- Frequency moment
- F0 number of distinct elements
- F1 size of stream
- F2
4Overall Scheme
- Design estimator (i.e. random variable) with the
right expectation - If estimator is tightly concentrated, maintain
number of independent copies of estimator E1, E2,
, Er - Obtain estimate E from E1, E2, , Er
- Within (1?) with probability 1-?
5Randomness
- Design estimator assuming perfect hash functions,
as much randomness as needed - Too much space required to explicitly store such
a hash function - Fix later by showing that limited randomness
suffices
6Distinct Elements
- Estimate the number of distinct elements in a
data stream - Brute Force solution Maintain list of distinct
elements seen so far - ?(n) storage
- Can we do better ?
7Distinct ElementsFlajolet, Martin 83
- Pick a random hash function hn ? 0,1
- Say there are k distinct elements
- Then minimum value of h over k distinct elements
is around 1/k - Apply h() to every element of data stream
maintain minimum value - Estimator 1/minimum
8(Idealized) Analysis
- Assume perfectly random hash function hn ?
0,1 - S set of k elements of n
- X min a?S h(a)
- EX 1/(k1)
- VarX O(1/k2)
- Mean of O(1/?2) independent estimators is within
(1?) of 1/k with constant probability
9Analysis
- Alon,Matias,SzegedyAnalysis goes through with
pairwise independent hash functionh(x) axb - 2 approximation
- O(log n) space
- Many improvementsBar-Yossef,Jayram,Kumar,Sivakum
ar,Trevisan
10Estimating F2
- F2
- Brute force solution Maintain counters for all
distinct elements - Sampling ?
- n1/2 space
11Estimating F2Alon,Matias,Szegedy
- Pick a random hash functionhn ? 1,-1
- hi h(i)
- Z
- Z initially 0, add hi every time you see i
- Estimator X Z2
12Analyzing the F2 estimator
13Analyzing the F2 estimator
- Median of means gives good estimator
14What about the randomness ?
- Analysis only requires 4-wise independence of
hash function h - Pick h from 4-wise independent family
- O(log n) space representation, efficient
computation of h(i)
15Properties of F2 estimator
- sketch of data stream that allows computation
of - Linear function of mi
- Can be added, subtracted
- Given two streams, frequencies mi , ni
- E(Z1-Z2)2
- Estimate L2 norm of difference
- How about L1 norm ? Lp norm ?
16Stable Distributions
- p-Stable distribution DIf X1, X2, Xn are
i.i.d. samples from D,m1X1m2X2mnXn is
distributed as(m1,m2,,mn)pX - Defining property up to scale factor
- Gaussian distribution is 2-stable
- Cauchy distribution is 1-stable
- p-Stable distributions exist for all0 lt p ? 2
17Estimating Lp normIndyk 00
- Compute Z m1X1m2X2mnXn
- Distributed as (m1,m2,,mn)pX
- Estimate scale factor of distributionfrom Z1,
Z2, Zr - Given i.i.d. samples from p-stable distribution,
how to estimate scale ? - Compute statistical property of samples and
compare to that of distribution
18Estimating scale factor
- Zi distributed as (m1,m2,,mn)pX
- Estimate scale factor of distributionfrom Z1,
Z2, Zr - Mean(Z1,Z2,,Zr) works for Gaussian distribution
- p1, Cauchy distribution does not have finite
mean ! - Median(Z1,Z2,,Zr) works in this case
-
- Note sketch is linear, nice properties follow
19What about the randomness ?
- Analog of 4-wise independence for F2 estimator ?
- Key insight p-stable based sketch computation
done in O(log n) space - Use pseudo-random number generator which fools
any space bounded computation Nisan 90 - Difference between using truly random and
psuedo-random bits is negligible - Random seed of polylogarithmic size, efficient
generation of required pseudo-random variables
20Talk Outline
- Similarity preserving hash functions
- Similarity estimation
- Statistical properties of data streams
- Distinct elements
- Frequency moments, norm estimation
- Frequent items
21Variants of F2 estimatorAlon, Gibbons, Matias,
Szegedy
- Estimate join size of two relations(m1,m2,)
(n1,n2,) -
- Variance may be too high
22Finding Frequent Items
C,Chen,Farach-Colton 02
- Goal
- Given a data stream, return an approximate list
of the k most frequent items in one pass and
sub-linear space - Applications
- Analyzing search engine queries, network
traffic.
23Finding Frequent Items
- ai ith most frequent element
- mi frequency
- If we had an oracle that gave us exact
frequencies, can find most frequent items in one
pass - Solution
- A data structure called a Count Sketch that
gives good estimates of frequencies of the high
frequency elements at every point in the stream
24Intuition
- Consider a single counter X with a single hash
function ha ? 1, -1 - On seeing each element ai, update the counter
with X h(ai) - X ? mi h(ai)
- Claim EX h(ai) mi
- Proof idea Cross-terms cancel because of
pairwise independence
25Finding the max element
- Problem with the single counter scheme variance
is too high - Replace with an array of t counters, using
independent hash functions h1... ht
h1 a ? 1, -1
ht a ? 1, -1
26Analysis of array of counters data structure
- Expectation still correct
- Claim Variance of final estimate lt ? mi2 /t
- Variance of each estimate lt ? mi2
- Proof idea cross-terms cancel
- Set t O(log n ? mi2 / (?m1)2) to get answer
with high prob. - Proof idea median of averages
27Problem with array of counters data structure
- Variance of estimator dominated by contribution
of large elements - Estimates for important elements such as ak
corrupted by larger elements (variance much more
than mk2 ) - To avoid collisions, replace each counter with a
hash table of b counters to spread out the large
elements
28In Conclusion
- Simple powerful ideas at the heart of several
algorithmic techniques for large data sets - Sketches of data tailored to applications
- Many interesting research questions