Estimating Distinct Elements, Optimally - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Estimating Distinct Elements, Optimally

Description:

Estimating Distinct Elements, Optimally David Woodruff IBM Based on papers with Piotr Indyk, Daniel Kane, and Jelani Nelson – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 18
Provided by: IBMU88
Category:

less

Transcript and Presenter's Notes

Title: Estimating Distinct Elements, Optimally


1
Estimating Distinct Elements, Optimally
  • David Woodruff
  • IBM
  • Based on papers with Piotr Indyk, Daniel Kane,
    and Jelani Nelson

2
Problem Description
  • Given a long string of at most n distinct
    characters, count the number F0 of distinct
    characters
  • See characters one at a time
  • One pass over the string
  • Algorithms must use small memory and fast update
    time
  • too expensive to store set of distinct characters
  • algorithms must be randomized and settle for an
    approximate solution output F 2 (1-²)F0,
    (1²)F0 with, say, good constant probability

3
Algorithm History
  • Flajolet and Martin introduced problem, FOCS 1983
  • O(log n) space for fixed e in random oracle model
  • Alon, Matias and Szegedy
  • O(log n) space/update time for fixed e with no
    oracle
  • Gibbons and Tirthapura
  • O(e-2 log n) space and O(e-2) update time
  • Bar-Yossef et al
  • O(e-2 log n) space and O(log 1/e) update time
  • O(e-2 log log n log n) space and O(e-2) update
    time, essentially
  • Similar space bound also obtained by Flajolet et
    al in the random oracle model
  • Kane, Nelson and W
  • O(e-2 log n) space and O(1) update and
    reporting time
  • All time complexities are in unit-cost RAM model

4
Lower Bound History
  • Alon, Matias and Szegedy
  • Any algorithm requires O(log n) bits of space
  • Bar-Yossef
  • Any algorithm requires O(e-1) bits of space
  • Indyk and W
  • If e gt 1/n1/9, any algorithm needs O(e-2) bits of
    space
  • W
  • If e gt 1/n1/2, any algorithm needs O(e-2) bits of
    space
  • Jayram, Kumar and Sivakumar
  • Simpler proof of O(e-2) bound for any e gt 1/m1/2
  • Brody and Chakrabarti
  • Show above lower bounds hold even for multiple
    passes over the string
  • Combining upper and lower bounds, the complexity
    of this problem is
  • T(e-2 log n) space and T(1) update and
    reporting time

5
Outline for Remainder of Talk
  • Proofs of the Upper Bounds
  • Proofs of the Lower Bounds

6
Hash Functions for Throwing Balls
  • We consider a random mapping f of B balls into C
    containers and count the number of non-empty
    containers
  • The expected number of non-empty containers is C
    C(1-1/C)B
  • If instead of the mapping f, we use an O(log
    C/e)/log log C/e wise independent mapping g,
    then
  • the expected number of non-empty containers under
    g is the same as that under f, up to a factor of
    (1 e)
  • Proof based on approximate inclusion-exclusion
  • express 1 (1-1/C)B in terms of a series of
    binomial coefficients
  • truncate the series at an appropriate place
  • use limited independence to handle the remaining
    terms

7
Fast Hash Functions
  • Use hash functions g that can be evaluated in
    O(1) time.
  • If g is O(log C/e)/(log log C/e)-wise
    independent, the natural family of polynomial
    hash functions doesnt work
  • We use theorems due to Pagh, Pagh, and Siegel
    that construct k-wise independent families for
    large k, and allow O(1) evaluation time
  • For example, Siegel shows
  • Let U u and V v with u vc for a
    constant c gt 1, and suppose the machine word size
    is O(log v)
  • Let k vo(1) be arbitrary
  • For any constant d gt 0 there is a randomized
    procedure that constructs a k-wise independent
    hash family H from U to V that succeeds with
    probability 1-1/vd and requires vd space. Each h
    2 H can be evaluated in O(1) time
  • Can show we have sufficiently random hash
    functions that can be evaluated in O(1) time and
    represented with O(e-2 log n) bits of space

8
Algorithm Outline
  • Set K 1/e2
  • Instantiate a lg n x K bitmatrix A, initializing
    entries of A to 0
  • Pick random hash functions f n-gtn and g
    n-gtK
  • Obtain a constant factor approximation R to F0
    somehow
  • Update(i) Set A1, g(i) 1, A2, g(i) 1, ,
    Alsb(f(i)), g(i) 1
  • Estimator Let T j in K Alog (16R/K), j
    1
  • Output (32R/K)
    ln(1-T/K)/ln(1-1/K)

9
Space Complexity
  • Naively, A is a lg n x K bitmatrix, so O(e-2 log
    n) space
  • Better for each column j, store the identity of
    the largest row i(j) for which
  • Ai, j 1. Note if Ai,j 1, then Ai, j
    1 for all i lt i
  • Takes O(e-2 log log n) space
  • Better yet maintain a base level I. For each
    column j, store max(i(j) I, 0)
  • Given an O(1)-approximation R to F0 at each point
    in the stream, set
  • I log R
  • Dont need to remember i(j) if i(j) lt I, since j
    wont be used in estimator
  • For the j for which i(j) I, about 1/2 such j
    will have i(j) I, about one fourth such j will
    have i(j) I1, etc.
  • Total number of bits to store offsets is now only
    O(K) O(e-2) with good probability at all points
    in the stream

10
The Constant Factor Approximation
  • Previous algorithms state that at each point in
    the stream, with probability 1-d, the output is
    an O(1)-approximation to F0
  • The space of such algorithms is O(log n log 1/d).
  • Union-bounding over a stream of length m gives
    O(log n log m) total space
  • We achieve O(log n) space, and guarantee the
    O(1)-approximation R of the algorithm is
    non-decreasing
  • Apply the previous scheme on a log n x log n/(log
    log n) matrix
  • For each column, maintain the identity of the
    deepest row with value 1
  • Output 2i, where i is the largest row containing
    a constant fraction of 1s
  • We repeat the procedure O(1) times, and output
    the median of the estimates
  • Can show the output is correct with probability
    1- O(1/log n)
  • Then we use the non-decreasing property to
    union-bound over O(log n) events
  • We only increase the base level every time R
    increases by a factor of 2
  • Note that the base level never decreases

11
Running Time
  • Blandford and Blelloch
  • Definition a variable length array (VLA) is a
    data structure implementing an array C1, , Cn
    supporting the following operations
  • Update(i, x) sets the value of Ci to x
  • Read(i) returns Ci
  • The Ci are allowed to have bit
    representations of varying lengths len(Ci).
  • Theorem there is a VLA using O(n sumi len(Ci))
    bits of space supporting worst case O(1) updates
    and reads, assuming the machine word size is at
    least log n
  • Store our offsets in a VLA, giving O(1) update
    time for a fixed base level
  • Occasionally we need to update the base level and
    decrement offsets by 1
  • Show base level only increases after T(e-2)
    updates, so can spread this work across these
    updates, so O(1) worst-case update time
  • Copy the data structure, use it for performing
    this additional work so it doesnt interfere with
    reporting the correct answer
  • When base level changes, switch to copy
  • For O(1) reporting time, maintain a count of
    non-zero containers in a level

12
Outline for Remainder of Talk
  • Proofs of the Upper Bounds
  • Proofs of the Lower Bounds

13
1-Round Communication Complexity
Alice
Bob
What is f(x,y)?
input x
input y
  • Alice sends a single, randomized message M(x) to
    Bob
  • Bob outputs g(M(x), y) for a randomized function
    g
  • g(M(x), y) should equal f(x,y) with constant
    probability
  • Communication cost CC(f) is M(x), maximized
    over x and random bits
  • Alice creates a string s(x), runs a randomized
    algorithm A on s(x), and
  • transmits the state of A(s(x)) to Bob
  • Bob creates a string s(y), continues A on s(y),
    thus computing A(s(x)?s(y))
  • If A(s(x)?s(y)) can be used to solve f(x,y),
    then space(A) CC(f)

14
The O(log n) Bound
  • Consider equality function f(x,y) 1 if and
    only if x y for x, y 2 0,1n/3
  • Well known that CC(f) O(log n) for (n/3)-bit
    strings x and y
  • Let C 0,1n/3 -gt 0,1n be an error-correcting
    code with all codewords of Hamming weight n/10
  • If x y, then C(x) C(y)
  • If x ! y, then (C(x), C(y)) O(n)
  • Let s(x) be any string on alphabet size n with
    i-th character appearing in s(x) if and only if
    C(x)i 1. Similarly define s(y)
  • If x y, then F0(s(x)?s(y)) n/10. Else,
    F0(s(x)?s(y)) n/10 O(n)
  • A constant factor approximation to F0 solves
    f(x,y)

15
The O(e-2) Bound
  • Let r 1/e2. Gap Hamming promise problem for x,
    y in 0,1r
  • f(x,y) 1 if (x,y) gt 1/(2e-2)
  • f(x,y) 0 if (x,y) lt 1/(2e-2) - 1/e
  • Theorem CC(f) O(e-2)
  • Can prove this from the Indexing function
  • Alice has w 2 0,1r, Bob has i in 1, 2, , r,
    output g(w, i) wi
  • Well-known that CC(g) O(r)
  • Proof CC(f) O(r),
  • Alice sends the seed r of a pseudorandom
    generator to Bob, so the parties have common
    random strings zi, , zr 2 0,1r
  • Alice sets x coordinate-wise-majorityzi wj
    1
  • Bob sets y zi
  • Since the zi are random, if xj 1, then by
    properties of majority, with good probability
    f(x,y) lt 1/(2e-2) - 1/e, otherwise likely that
    f(x,y) gt 1/(2e-2)
  • Repeat a few times to get concentration

16
The O(e-2) Bound Continued
  • Need to create strings s(x) and s(y) to have
    F0(s(x)?s(y)) decide whether (x,y) gt 1/(2e-2) or
    (x,y) lt 1/(2e-2) - 1/e
  • Let s(x) be a string on n characters where
    character i appears if and only if xi 1.
    Similarly define s(y)
  • F0(s(x)?s(y)) (wt(x) wt(y) (x,y))/2
  • Alice sends wt(x) to Bob
  • A calculation shows a (1e)-approximation to
    F0(s(x)?s(y)), together with wt(x) and wt(y),
    solves the Gap-Hamming problem
  • Total communication is space(A) log 1/e
    O(e-2)
  • It follows that space(A) O(e-2)

17
Conclusion
Combining upper and lower bounds, the streaming
complexity of estimating F0 up to a (1e) factor
is T(e-2 log n) bits of space and T(1) update
and reporting time
  • Upper bounds based on careful combination of
    efficient hashing,
  • sampling and various data structures
  • Lower bounds come from 1-way communication
    complexity
Write a Comment
User Comments (0)
About PowerShow.com