Compact Representations in Streaming Algorithms - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Compact Representations in Streaming Algorithms

Description:

Problem with the single counter scheme: variance is too high ... Variance of estimator dominated by contribution of large elements ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 26

Provided by: csPrin

Category:

more less

Transcript and Presenter's Notes

Title: Compact Representations in Streaming Algorithms

1
Compact Representations in Streaming Algorithms

Moses CharikarPrinceton University

2
Talk Outline

Statistical properties of data streams
Distinct elements
Frequency moments, norm estimation
Frequent items

3
Frequency MomentsAlon, Matias, Szegedy 99

Stream consists of elements from 1,2,,n
mi number of times i occurs
Frequency moment
F0 number of distinct elements
F1 size of stream
F2

4
Overall Scheme

Design estimator (i.e. random variable) with the
right expectation
If estimator is tightly concentrated, maintain
number of independent copies of estimator E1, E2,
, Er
Obtain estimate E from E1, E2, , Er
Within (1?) with probability 1-?

5
Randomness

Design estimator assuming perfect hash functions,
as much randomness as needed
Too much space required to explicitly store such
a hash function
Fix later by showing that limited randomness
suffices

6
Distinct Elements

Estimate the number of distinct elements in a
data stream
Brute Force solution Maintain list of distinct
elements seen so far
?(n) storage
Can we do better ?

7
Distinct ElementsFlajolet, Martin 83

Pick a random hash function hn ? 0,1
Say there are k distinct elements
Then minimum value of h over k distinct elements
is around 1/k
Apply h() to every element of data stream
maintain minimum value
Estimator 1/minimum

8
(Idealized) Analysis

Assume perfectly random hash function hn ?
0,1
S set of k elements of n
X min a?S h(a)
EX 1/(k1)
VarX O(1/k2)
Mean of O(1/?2) independent estimators is within
(1?) of 1/k with constant probability

9
Analysis

Alon,Matias,SzegedyAnalysis goes through with
pairwise independent hash functionh(x) axb
2 approximation
O(log n) space
Many improvementsBar-Yossef,Jayram,Kumar,Sivakum
ar,Trevisan

10
Estimating F2

F2
Brute force solution Maintain counters for all
distinct elements
Sampling ?
n1/2 space

11
Estimating F2Alon,Matias,Szegedy

Pick a random hash functionhn ? 1,-1
hi h(i)
Z
Z initially 0, add hi every time you see i
Estimator X Z2

12
Analyzing the F2 estimator
13
Analyzing the F2 estimator

Median of means gives good estimator

14
What about the randomness ?

Analysis only requires 4-wise independence of
hash function h
Pick h from 4-wise independent family
O(log n) space representation, efficient
computation of h(i)

15
Properties of F2 estimator

sketch of data stream that allows computation
of
Linear function of mi
Can be added, subtracted
Given two streams, frequencies mi , ni
E(Z1-Z2)2
Estimate L2 norm of difference
How about L1 norm ? Lp norm ?

16
Stable Distributions

p-Stable distribution DIf X1, X2, Xn are
i.i.d. samples from D,m1X1m2X2mnXn is
distributed as(m1,m2,,mn)pX
Defining property up to scale factor
Gaussian distribution is 2-stable
Cauchy distribution is 1-stable
p-Stable distributions exist for all0 lt p ? 2

17
Estimating Lp normIndyk 00

Compute Z m1X1m2X2mnXn
Distributed as (m1,m2,,mn)pX
Estimate scale factor of distributionfrom Z1,
Z2, Zr
Given i.i.d. samples from p-stable distribution,
how to estimate scale ?
Compute statistical property of samples and
compare to that of distribution

18
Estimating scale factor

Zi distributed as (m1,m2,,mn)pX
Estimate scale factor of distributionfrom Z1,
Z2, Zr
Mean(Z1,Z2,,Zr) works for Gaussian distribution
p1, Cauchy distribution does not have finite
mean !
Median(Z1,Z2,,Zr) works in this case
Note sketch is linear, nice properties follow

19
What about the randomness ?

Analog of 4-wise independence for F2 estimator ?
Key insight p-stable based sketch computation
done in O(log n) space
Use pseudo-random number generator which fools
any space bounded computation Nisan 90
Difference between using truly random and
psuedo-random bits is negligible
Random seed of polylogarithmic size, efficient
generation of required pseudo-random variables

20
Talk Outline

Similarity preserving hash functions
Similarity estimation
Statistical properties of data streams
Distinct elements
Frequency moments, norm estimation
Frequent items

21
Variants of F2 estimatorAlon, Gibbons, Matias,
Szegedy

Estimate join size of two relations(m1,m2,)
(n1,n2,)
Variance may be too high

22
Finding Frequent Items
C,Chen,Farach-Colton 02

Goal
Given a data stream, return an approximate list
of the k most frequent items in one pass and
sub-linear space
Applications
Analyzing search engine queries, network
traffic.

23
Finding Frequent Items

ai ith most frequent element
mi frequency
If we had an oracle that gave us exact
frequencies, can find most frequent items in one
pass
Solution
A data structure called a Count Sketch that
gives good estimates of frequencies of the high
frequency elements at every point in the stream

24
Intuition

Consider a single counter X with a single hash
function ha ? 1, -1
On seeing each element ai, update the counter
with X h(ai)
X ? mi h(ai)
Claim EX h(ai) mi
Proof idea Cross-terms cancel because of
pairwise independence

25
Finding the max element

Problem with the single counter scheme variance
is too high
Replace with an array of t counters, using
independent hash functions h1... ht

h1 a ? 1, -1
ht a ? 1, -1
26
Analysis of array of counters data structure

Expectation still correct
Claim Variance of final estimate lt ? mi2 /t
Variance of each estimate lt ? mi2
Proof idea cross-terms cancel
Set t O(log n ? mi2 / (?m1)2) to get answer
with high prob.
Proof idea median of averages

27
Problem with array of counters data structure

Variance of estimator dominated by contribution
of large elements
Estimates for important elements such as ak
corrupted by larger elements (variance much more
than mk2 )
To avoid collisions, replace each counter with a
hash table of b counters to spread out the large
elements

28
In Conclusion

Simple powerful ideas at the heart of several
algorithmic techniques for large data sets
Sketches of data tailored to applications
Many interesting research questions

Write a Comment

User Comments (0)