Compact Representations in Streaming Algorithms - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Compact Representations in Streaming Algorithms

Description:

Problem with the single counter scheme: variance is too high ... Variance of estimator dominated by contribution of large elements ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 26
Provided by: csPrin
Category:

less

Transcript and Presenter's Notes

Title: Compact Representations in Streaming Algorithms


1
Compact Representations in Streaming Algorithms
  • Moses CharikarPrinceton University

2
Talk Outline
  • Statistical properties of data streams
  • Distinct elements
  • Frequency moments, norm estimation
  • Frequent items

3
Frequency MomentsAlon, Matias, Szegedy 99
  • Stream consists of elements from 1,2,,n
  • mi number of times i occurs
  • Frequency moment
  • F0 number of distinct elements
  • F1 size of stream
  • F2

4
Overall Scheme
  • Design estimator (i.e. random variable) with the
    right expectation
  • If estimator is tightly concentrated, maintain
    number of independent copies of estimator E1, E2,
    , Er
  • Obtain estimate E from E1, E2, , Er
  • Within (1?) with probability 1-?

5
Randomness
  • Design estimator assuming perfect hash functions,
    as much randomness as needed
  • Too much space required to explicitly store such
    a hash function
  • Fix later by showing that limited randomness
    suffices

6
Distinct Elements
  • Estimate the number of distinct elements in a
    data stream
  • Brute Force solution Maintain list of distinct
    elements seen so far
  • ?(n) storage
  • Can we do better ?

7
Distinct ElementsFlajolet, Martin 83
  • Pick a random hash function hn ? 0,1
  • Say there are k distinct elements
  • Then minimum value of h over k distinct elements
    is around 1/k
  • Apply h() to every element of data stream
    maintain minimum value
  • Estimator 1/minimum

8
(Idealized) Analysis
  • Assume perfectly random hash function hn ?
    0,1
  • S set of k elements of n
  • X min a?S h(a)
  • EX 1/(k1)
  • VarX O(1/k2)
  • Mean of O(1/?2) independent estimators is within
    (1?) of 1/k with constant probability

9
Analysis
  • Alon,Matias,SzegedyAnalysis goes through with
    pairwise independent hash functionh(x) axb
  • 2 approximation
  • O(log n) space
  • Many improvementsBar-Yossef,Jayram,Kumar,Sivakum
    ar,Trevisan

10
Estimating F2
  • F2
  • Brute force solution Maintain counters for all
    distinct elements
  • Sampling ?
  • n1/2 space

11
Estimating F2Alon,Matias,Szegedy
  • Pick a random hash functionhn ? 1,-1
  • hi h(i)
  • Z
  • Z initially 0, add hi every time you see i
  • Estimator X Z2

12
Analyzing the F2 estimator
13
Analyzing the F2 estimator
  • Median of means gives good estimator

14
What about the randomness ?
  • Analysis only requires 4-wise independence of
    hash function h
  • Pick h from 4-wise independent family
  • O(log n) space representation, efficient
    computation of h(i)

15
Properties of F2 estimator
  • sketch of data stream that allows computation
    of
  • Linear function of mi
  • Can be added, subtracted
  • Given two streams, frequencies mi , ni
  • E(Z1-Z2)2
  • Estimate L2 norm of difference
  • How about L1 norm ? Lp norm ?

16
Stable Distributions
  • p-Stable distribution DIf X1, X2, Xn are
    i.i.d. samples from D,m1X1m2X2mnXn is
    distributed as(m1,m2,,mn)pX
  • Defining property up to scale factor
  • Gaussian distribution is 2-stable
  • Cauchy distribution is 1-stable
  • p-Stable distributions exist for all0 lt p ? 2

17
Estimating Lp normIndyk 00
  • Compute Z m1X1m2X2mnXn
  • Distributed as (m1,m2,,mn)pX
  • Estimate scale factor of distributionfrom Z1,
    Z2, Zr
  • Given i.i.d. samples from p-stable distribution,
    how to estimate scale ?
  • Compute statistical property of samples and
    compare to that of distribution

18
Estimating scale factor
  • Zi distributed as (m1,m2,,mn)pX
  • Estimate scale factor of distributionfrom Z1,
    Z2, Zr
  • Mean(Z1,Z2,,Zr) works for Gaussian distribution
  • p1, Cauchy distribution does not have finite
    mean !
  • Median(Z1,Z2,,Zr) works in this case
  • Note sketch is linear, nice properties follow

19
What about the randomness ?
  • Analog of 4-wise independence for F2 estimator ?
  • Key insight p-stable based sketch computation
    done in O(log n) space
  • Use pseudo-random number generator which fools
    any space bounded computation Nisan 90
  • Difference between using truly random and
    psuedo-random bits is negligible
  • Random seed of polylogarithmic size, efficient
    generation of required pseudo-random variables

20
Talk Outline
  • Similarity preserving hash functions
  • Similarity estimation
  • Statistical properties of data streams
  • Distinct elements
  • Frequency moments, norm estimation
  • Frequent items

21
Variants of F2 estimatorAlon, Gibbons, Matias,
Szegedy
  • Estimate join size of two relations(m1,m2,)
    (n1,n2,)
  • Variance may be too high

22
Finding Frequent Items
C,Chen,Farach-Colton 02
  • Goal
  • Given a data stream, return an approximate list
    of the k most frequent items in one pass and
    sub-linear space
  • Applications
  • Analyzing search engine queries, network
    traffic.

23
Finding Frequent Items
  • ai ith most frequent element
  • mi frequency
  • If we had an oracle that gave us exact
    frequencies, can find most frequent items in one
    pass
  • Solution
  • A data structure called a Count Sketch that
    gives good estimates of frequencies of the high
    frequency elements at every point in the stream

24
Intuition
  • Consider a single counter X with a single hash
    function ha ? 1, -1
  • On seeing each element ai, update the counter
    with X h(ai)
  • X ? mi h(ai)
  • Claim EX h(ai) mi
  • Proof idea Cross-terms cancel because of
    pairwise independence

25
Finding the max element
  • Problem with the single counter scheme variance
    is too high
  • Replace with an array of t counters, using
    independent hash functions h1... ht

h1 a ? 1, -1
ht a ? 1, -1
26
Analysis of array of counters data structure
  • Expectation still correct
  • Claim Variance of final estimate lt ? mi2 /t
  • Variance of each estimate lt ? mi2
  • Proof idea cross-terms cancel
  • Set t O(log n ? mi2 / (?m1)2) to get answer
    with high prob.
  • Proof idea median of averages

27
Problem with array of counters data structure
  • Variance of estimator dominated by contribution
    of large elements
  • Estimates for important elements such as ak
    corrupted by larger elements (variance much more
    than mk2 )
  • To avoid collisions, replace each counter with a
    hash table of b counters to spread out the large
    elements

28
In Conclusion
  • Simple powerful ideas at the heart of several
    algorithmic techniques for large data sets
  • Sketches of data tailored to applications
  • Many interesting research questions
Write a Comment
User Comments (0)
About PowerShow.com