The Average Case Complexity of Counting Distinct Elements - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

The Average Case Complexity of Counting Distinct Elements

Description:

Given a data stream of n insertions of records, count the number F0 of ... Suppose we are mining salaries, and they are ordered alphabetically by surname. ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 23
Provided by: IBMU328
Category:

less

Transcript and Presenter's Notes

Title: The Average Case Complexity of Counting Distinct Elements


1
The Average Case Complexity of Counting Distinct
Elements
  • David Woodruff
  • IBM Almaden

2
Problem Description
  • Given a data stream of n insertions of records,
    count the number F0 of distinct records
  • One pass over the data stream
  • Algorithms must use small amount of memory and
    have fast update time
  • too expensive to store set of distinct records
  • implies algorithms must be randomized and must
    settle for an approximate solution output
  • F 2 (1-²)F0, (1²)F0 with constant
    probability

3
The Data Stream
  • How is the data in the stream organized?
  • Usually, assume worst-case ordered
  • In this case, (1/²2) bits are necessary and
    sufficient to estimate F0
  • As the quality of approximation improves to say,
    ² 1, the quadratic dependence is a major
    drawback
  • Sometimes a random-ordering can be assumed
  • Suppose we are mining salaries, and they are
    ordered alphabetically by surname. If there is no
    correlation between salaries and surname, then
    the stream of salaries is ordered randomly
  • The backing sample architecture assumes data
    randomly ordered by design (Gibbons, Matias,
    Poosala)
  • This model is referred to as the Random-Order
    Model (Guha, McGregor)
  • Unfortunately, even in this model, we still need
    ?(1/²2) bits to estimate F0
  • (Chakrabarti, Cormode, McGregor).
    Intuitively this is because the data is still
    worst-case.

4
Random Data Model
  • In an attempt to bypass the ?(1/²2) bound, we
    propose to study the case when the data comes
    from an underlying distribution.
  • Problem 3 of Muthukrishnans book Provide
    inproved estimates for Lp sums including distinct
    element estimation if input stream has
    statistical properties such as being Zipfian.
  • There is a distribution defined by probabilities
    pi, 1 i m, ?i pi 1.
  • The next item in the stream is chosen
    independently of previous
  • items, and is i with probability pi.
  • We call this the Random-Data Model.
  • The Random-Data Model is contained in the
    Random-Order Model.

5
Random Data Model
  • This model for F0 was implicitly studied before
  • by Motwani and Vassilvitskii when the
    distribution is Zipfian. This distribution is
    useful for estimating WebGraph statistics
  • sampling-based algorithms used in practice impose
    distributional assumptions, without which they
    have poor performance (Charikar et al)
  • Generalized Inverse Gaussian Poisson (GIGP) model
    studies sampling-based estimators when
    distribution is uniform and Zipfian
  • by Guha and McGregor for estimating the density
    function of an unknown distribution, which is
    useful in learning theory

6
Further Restriction
  • We focus on the case when each probability pi
    1/d for an unknown value of d (so uniform over a
    subset of m)
  • Captures the setting of sampling with replacement
    a set of unknown cardinality
  • For a certain range of d, we show that one can
    beat the space lower bound that holds for
    adversarial data and randomly-ordered data
  • For another choice of d, we show the lower bound
    for adversarial and randomly-ordered data also
    applies in this setting
  • Distribution fairly robust in the sense that
    other distributions with a few heavy items and
    remaining items that are approximately uniform
    have the same properties above

7
Our Upper Bound
  • 1-pass algorithm with an expected O(d(log
    1/²)/(n²2) log m) bits of space, whenever d
    1/²2 and d n. The per-item processing time is
    constant.
  • Recall the distribution is uniform over a
    d-element subset of m, and we see n samples
    from it, so this is a typical setting of
    parameters.
  • Notice for n even slightly larger than d, the
    algorithm does much better than the ?(1/²2) lower
    bound in other data stream models.
  • One can show for every combination of known
    algorithms with different space/time tradeoffs
    for F0 in the adversarial model, our algorithm is
    either better in space or in time.

8
Our Lower Bound
  • Our main technical result is that if n and d are
    (1/²2), then even estimating F0 in the random
    data model requires ?(1/²2) space
  • Lower bound subsumes previous lower bounds,
    showing that even for a natural (random) choice
    of data, the problem is hard
  • Our choice of distribution for showing the lower
    bound was used in subsequent work by Chakrabarti
    and Brody, which turned out to be useful for
    establishing an ?(1/²2) lower bound for constant
    pass algorithms for estimating F0

9
Techniques Upper Bound
  • Very simple observation
  • Since d n, each item should have frequency n/d
    in the stream.
  • If n/d is at least ?(1/²2), we can just compute n
    and the frequency of the first item in the stream
    to get a (1²)-approximation to d
  • Using a balls-and-bins occupancy bound of Kamath
    et al, a good estimation of d implies a good
    estimation to F0
  • If n/d is less than 1/²2, we could instead store
    the first O(1/²2) items (hashed appropriately for
    small space), treat them as a set S, and count
    the number of items in the remaining part of the
    stream that land in S
  • Correct, but unnecessary if d is much less than n.

10
Techniques Upper Bound
  • Instead record the first item x in the stream,
    and find the second position i of x in the
    stream.
  • Position i should occur at roughly the d-th
    position in the stream, so i provides a constant
    factor approximation to d
  • Since n ?(d), position i should be in the first
    half of the stream with large constant
    probability.
  • Now store the first i/(n²2) distinct stream
    elements in the second half of stream, treat them
    as a set S, and count the remaining items in the
    stream that occur in S.
  • Good enough for (1²)-approximation
  • Space is O(log n (i log m)/(n²2)) ¼ O(log n
    (d log m)/(n²2))

11
Techniques Upper Bound
  • Space is O(log n (d log m)/(n²2)), but we can
    do better.
  • For each j in m, sample j independently with
    probability 1/(i²2). In expectation new
    distribution is now uniform over d/(i²2) items.
    If j is sampled, say j survives.
  • Go back to the previous step store the first
    i/(n²2) distinct surviving stream elements in the
    second half of stream, treat them as a set S, and
    count the remaining items in the stream that
    occur in S.
  • Since only (1/²2) items survive, can store S
    with only (i log 1/²)/(n²2) bits by hashing item
    IDs down to a range of size, say, 1/²5
  • We estimate the distributions support size in
    the sub-sampled stream, which is roughly d/(i²2).
    We can get a (1²)-approximation to this quantity
    provided it is at least ?(1/²2), which it is with
    high probability. Then scale by i²2 to estimate
    d, and thus F0 by previous reasoning.
  • Constant update time

12
Techniques Lower Bound
  • Consider the uniform distribution ¹ on d 1,
    2, , d. d and n are (1/²2)
  • Consider a stream of n samples from ¹ where we
    choose n so that
  • 1-(1-1/d)n/2
    in 1/3, 2/3
  • Let X be the characteristic vector of the first
    n/2 stream samples on the universe d. So Xi 1
    if and only if item i occurs in these samples.
  • Let Y be the characteristic vector of the second
    n/2 stream samples on the universe d. So Yi 1
    if and only if item i occurs in these samples.
  • Let wt(X), wt(Y) be the number of ones in vectors
    X, Y.
  • We consider a communication game. Alice is given
    X and wt(Y), while Bob is given Y and wt(X), and
    they want to know if
  • Delta(X,Y) wt(X) wt(Y)-2wt(X)wt(Y)/d
  • Alice and Bob should solve this problem with
    large probability, where the probability is over
    the choice of X and Y, and their coin tosses
    (which can be fixed).

13
Techniques Lower Bound
  • Strategy
  • (1) show the space complexity S of the streaming
    algorithm is at least the one-way communication
    complexity CC of the game
  • (2) lower bound CC
  • Theorem S CC
  • Proof Alice runs the streaming algorithm on a
    random stream aX generated by her characteristic
    vector X. Alice transmits the state to Bob
  • Bob continues the computation of the
    streaming algorithm on a random stream aY with
    characteristic vector Y
  • At the end the algorithm estimates F0 of a
    stream whose elements are in the support of X or
    Y (or both)
  • Notice that the two halves of the stream are
    independent in the random data model, so the
    stream generated has the right distribution

14
Techniques Lower Bound
  • We show the estimate of F0 can solve the
    communication game
  • Remember, the communication game is to decide
    whether
  • Delta(X,Y) wt(X) wt(Y)-2wt(X)wt(Y)/d
  • Note the quantity on the right is the expected
    value of Delta(X,Y)
  • Some Lemmas
  • (1) Prd/4 wt(X), wt(Y) 3d/4 1-o(1)
  • (2) Consider the variable X, distributed as
    X, but conditioned on wt(X) k. The variable Y
    is distributed as Y, but conditioned on wt(Y)
    r.
  • X and Y are uniform over k and r bit strings,
    respectively
  • Choose k, r to be integers in d/4, 3d/4
  • Then for any constant d gt 0, there is a constant
    gt 0, so that
  • Pr(X,Y) E(X, Y) d1/2 1-
    d
  • Follows from standard deviation of
    hypergeometric distribution

15
Techniques Lower Bound
  • (X,Y) 2F0(aX?aY)-wt(X)-wt(Y)
  • Note that Bob has wt(X) and wt(Y), so he can
    compute
  • wt(X) wt(Y) wt(X)wt(Y)/d
  • If F is the output of the streaming algorithm,
    Bob simply computes
  • 2F-wt(X)-wt(Y) and checks if it is
    greater than
  • F 2 (1-²)F0, (1²)F0 with large constant
    probability
  • 2F-wt(X)-wt(Y) 2 2F0-wt(X)-wt(Y)-2²F0,,
    2F0-wt(X)-wt(Y)2²F0
  • (X,Y)-2²F0, (X,Y)
    2²F0

  • (X,Y)-(²d), (X,Y) (²d)
  • (X,Y) - (d1/2), (X,Y) (d1/2)

16
Techniques Lower Bound
  • By our lemmas, and using that E(X, Y) ,
    with large constant probability either (X,Y) gt
    d1/2 or (X,Y) lt - d1/2
  • By previous slide, Bobs value 2F-wt(X)-wt(Y) is
    in the range
  • (X,Y) - (d1/2), (X,Y) (d1/2)
  • If (X,Y) gt d1/2, then 2F-wt(X)-wt(Y) gt
    d1/2 -(d1/2)
  • If (X,Y) lt - d1/2, then 2F-wt(X)-wt(Y) lt -
    d1/2 (d1/2)
  • So Bob can use the output of the streaming
    algorithm to solve the communication problem, so
    S CC

17
Lower Bounding CC
  • Communication game Alice is given X and wt(Y),
    while Bob is given Y and wt(X), and they want to
    know if
  • Delta(X,Y) wt(X) wt(Y)-2wt(X)wt(Y)/d
  • Here X,Y have the distribution of being
    independent characteristic vectors of a stream on
    n/2 samples, where
  • 1-(1-1/d)n/2 2 1/3, 2/3

With large probability, Prd/4 wt(X), wt(Y)
3d/4 1-o(1) By averaging, a correct protocol
is also correct with large probability for fixed
weights i and j in d/4, 3d/4, so we can assume
X is a random string of Hamming weight i, and Y a
random string of Hamming weight j.
18
Lower Bounding CC
Y such that wt(Y) j
X such that wt(X) i
Since i, j 2 d/4, 3d/4, one can show that the
fraction of 1s in each row is in
1/2-o(1),1/2o(1) Recall that an entry is 1
if and only if (X,Y) i j 2ij/d
With large probability, the message M that Alice
sends has the property that many different X
cause Alice to send M. Say such an M is large.
We show that for any large M, the fraction of 1s
in most of the columns is in, say, 1/10, 9/10.
Then Bob doesnt know what to output.
Since each row is roughly balanced, the expected
fraction of 1s in each column is in 1/2-o(1),
1/2o(1).
19
Lower Bounding CC
Since each row is roughly balanced, the expected
fraction of 1s in each column is in 1/2-o(1),
1/2o(1). But the variance could be huge,
e.g., the matrix may look like this, in which
case Bob can easily output the answer.
20
Lower Bounding CC
  • Can show this doesnt happen by the second-moment
    method.
  • Let Vy be the fraction of 1s in the column
    indexed by y.
  • Let S be the set of x which cause Alice to send a
    large message M.
  • Consider random Y. Let Cu 1 if and only if
    (u,Y) gt . So V (1/S) sumu in S Cu
  • VarV (1/S2) sumu,v in S (ECu Cv
    ECuECv).
  • Notice that ECuECv 2 1/4-o(1), 1/4o(1)
  • while ECu Cv Pr(u,Y) gt (v,Y) gt
    ½.
  • Since S is a large set, most pairs u, v 2 S have
    Hamming distance in d/5, 4d/5.
  • A technical lemma shows that Pr(u,Y) gt
    (v,Y) gt is a constant strictly less than 1.
  • Hence, ECu Cv is a constant strictly less than
    1/2, and since this happens for most pairs u, v,
    we get that VarV is a constant strictly less
    than 1/4.

21
Lower Bounding CC
  • Fraction of 1s in a random column is just V
    (1/S) sumu in S Cu
  • Let be a small positive constant. By
    Chebyshevs inequality,
  • PrV-1/2 gt 1/2- lt PrV-EV gt ½--o(1)
    VarV/(1/2-)2 o(1)
  • But we showed VarV is a constant strictly less
    than ¼, so this probability is a constant
    strictly less than 1 for small enough .
  • Hence, for a random column Y, the fraction of 1s
    is in , 1- with constant probability.
  • It follows that with some constant probability
    Bob outputs the wrong answer.
  • Hence, most messages of Alice must be small, so
    one can show that there must be 2?(d) of them, so
    that communication is ?(d) ?(1/²2).

22
Conclusions
  • Introduced random data model, and studied
    F0-estimation under distributions uniform over a
    subset of d items
  • For a certain range of d, we show that one can
    beat the space lower bound that holds for
    adversarial data and randomly-ordered data
  • For another choice of d, we show the lower bound
    for adversarial and randomly-ordered data also
    applies in this setting
  • Are there other natural distributions that admit
    more space-efficient algorithms in this model?
    Are there other useful ways of bypassing the
    ?(1/²2) lower bound?
Write a Comment
User Comments (0)
About PowerShow.com