Title: The Average Case Complexity of Counting Distinct Elements
1The Average Case Complexity of Counting Distinct
Elements
- David Woodruff
- IBM Almaden
2Problem Description
- Given a data stream of n insertions of records,
count the number F0 of distinct records - One pass over the data stream
- Algorithms must use small amount of memory and
have fast update time - too expensive to store set of distinct records
- implies algorithms must be randomized and must
settle for an approximate solution output - F 2 (1-²)F0, (1²)F0 with constant
probability
3The Data Stream
- How is the data in the stream organized?
- Usually, assume worst-case ordered
- In this case, (1/²2) bits are necessary and
sufficient to estimate F0 - As the quality of approximation improves to say,
² 1, the quadratic dependence is a major
drawback - Sometimes a random-ordering can be assumed
- Suppose we are mining salaries, and they are
ordered alphabetically by surname. If there is no
correlation between salaries and surname, then
the stream of salaries is ordered randomly - The backing sample architecture assumes data
randomly ordered by design (Gibbons, Matias,
Poosala) - This model is referred to as the Random-Order
Model (Guha, McGregor) - Unfortunately, even in this model, we still need
?(1/²2) bits to estimate F0 - (Chakrabarti, Cormode, McGregor).
Intuitively this is because the data is still
worst-case.
4Random Data Model
- In an attempt to bypass the ?(1/²2) bound, we
propose to study the case when the data comes
from an underlying distribution. - Problem 3 of Muthukrishnans book Provide
inproved estimates for Lp sums including distinct
element estimation if input stream has
statistical properties such as being Zipfian. - There is a distribution defined by probabilities
pi, 1 i m, ?i pi 1. - The next item in the stream is chosen
independently of previous - items, and is i with probability pi.
- We call this the Random-Data Model.
- The Random-Data Model is contained in the
Random-Order Model.
5Random Data Model
- This model for F0 was implicitly studied before
- by Motwani and Vassilvitskii when the
distribution is Zipfian. This distribution is
useful for estimating WebGraph statistics - sampling-based algorithms used in practice impose
distributional assumptions, without which they
have poor performance (Charikar et al) - Generalized Inverse Gaussian Poisson (GIGP) model
studies sampling-based estimators when
distribution is uniform and Zipfian - by Guha and McGregor for estimating the density
function of an unknown distribution, which is
useful in learning theory
6Further Restriction
- We focus on the case when each probability pi
1/d for an unknown value of d (so uniform over a
subset of m) - Captures the setting of sampling with replacement
a set of unknown cardinality - For a certain range of d, we show that one can
beat the space lower bound that holds for
adversarial data and randomly-ordered data - For another choice of d, we show the lower bound
for adversarial and randomly-ordered data also
applies in this setting - Distribution fairly robust in the sense that
other distributions with a few heavy items and
remaining items that are approximately uniform
have the same properties above
7Our Upper Bound
- 1-pass algorithm with an expected O(d(log
1/²)/(n²2) log m) bits of space, whenever d
1/²2 and d n. The per-item processing time is
constant. - Recall the distribution is uniform over a
d-element subset of m, and we see n samples
from it, so this is a typical setting of
parameters. - Notice for n even slightly larger than d, the
algorithm does much better than the ?(1/²2) lower
bound in other data stream models. - One can show for every combination of known
algorithms with different space/time tradeoffs
for F0 in the adversarial model, our algorithm is
either better in space or in time.
8Our Lower Bound
- Our main technical result is that if n and d are
(1/²2), then even estimating F0 in the random
data model requires ?(1/²2) space - Lower bound subsumes previous lower bounds,
showing that even for a natural (random) choice
of data, the problem is hard - Our choice of distribution for showing the lower
bound was used in subsequent work by Chakrabarti
and Brody, which turned out to be useful for
establishing an ?(1/²2) lower bound for constant
pass algorithms for estimating F0
9Techniques Upper Bound
- Very simple observation
- Since d n, each item should have frequency n/d
in the stream. - If n/d is at least ?(1/²2), we can just compute n
and the frequency of the first item in the stream
to get a (1²)-approximation to d - Using a balls-and-bins occupancy bound of Kamath
et al, a good estimation of d implies a good
estimation to F0 - If n/d is less than 1/²2, we could instead store
the first O(1/²2) items (hashed appropriately for
small space), treat them as a set S, and count
the number of items in the remaining part of the
stream that land in S - Correct, but unnecessary if d is much less than n.
10Techniques Upper Bound
- Instead record the first item x in the stream,
and find the second position i of x in the
stream. - Position i should occur at roughly the d-th
position in the stream, so i provides a constant
factor approximation to d - Since n ?(d), position i should be in the first
half of the stream with large constant
probability. - Now store the first i/(n²2) distinct stream
elements in the second half of stream, treat them
as a set S, and count the remaining items in the
stream that occur in S. - Good enough for (1²)-approximation
- Space is O(log n (i log m)/(n²2)) ¼ O(log n
(d log m)/(n²2))
11Techniques Upper Bound
- Space is O(log n (d log m)/(n²2)), but we can
do better. - For each j in m, sample j independently with
probability 1/(i²2). In expectation new
distribution is now uniform over d/(i²2) items.
If j is sampled, say j survives. - Go back to the previous step store the first
i/(n²2) distinct surviving stream elements in the
second half of stream, treat them as a set S, and
count the remaining items in the stream that
occur in S. - Since only (1/²2) items survive, can store S
with only (i log 1/²)/(n²2) bits by hashing item
IDs down to a range of size, say, 1/²5 - We estimate the distributions support size in
the sub-sampled stream, which is roughly d/(i²2).
We can get a (1²)-approximation to this quantity
provided it is at least ?(1/²2), which it is with
high probability. Then scale by i²2 to estimate
d, and thus F0 by previous reasoning. - Constant update time
12Techniques Lower Bound
- Consider the uniform distribution ¹ on d 1,
2, , d. d and n are (1/²2) - Consider a stream of n samples from ¹ where we
choose n so that - 1-(1-1/d)n/2
in 1/3, 2/3 - Let X be the characteristic vector of the first
n/2 stream samples on the universe d. So Xi 1
if and only if item i occurs in these samples. - Let Y be the characteristic vector of the second
n/2 stream samples on the universe d. So Yi 1
if and only if item i occurs in these samples. - Let wt(X), wt(Y) be the number of ones in vectors
X, Y. - We consider a communication game. Alice is given
X and wt(Y), while Bob is given Y and wt(X), and
they want to know if - Delta(X,Y) wt(X) wt(Y)-2wt(X)wt(Y)/d
- Alice and Bob should solve this problem with
large probability, where the probability is over
the choice of X and Y, and their coin tosses
(which can be fixed).
13Techniques Lower Bound
- Strategy
- (1) show the space complexity S of the streaming
algorithm is at least the one-way communication
complexity CC of the game - (2) lower bound CC
- Theorem S CC
- Proof Alice runs the streaming algorithm on a
random stream aX generated by her characteristic
vector X. Alice transmits the state to Bob - Bob continues the computation of the
streaming algorithm on a random stream aY with
characteristic vector Y - At the end the algorithm estimates F0 of a
stream whose elements are in the support of X or
Y (or both) - Notice that the two halves of the stream are
independent in the random data model, so the
stream generated has the right distribution
14Techniques Lower Bound
- We show the estimate of F0 can solve the
communication game - Remember, the communication game is to decide
whether - Delta(X,Y) wt(X) wt(Y)-2wt(X)wt(Y)/d
- Note the quantity on the right is the expected
value of Delta(X,Y) - Some Lemmas
- (1) Prd/4 wt(X), wt(Y) 3d/4 1-o(1)
- (2) Consider the variable X, distributed as
X, but conditioned on wt(X) k. The variable Y
is distributed as Y, but conditioned on wt(Y)
r. - X and Y are uniform over k and r bit strings,
respectively - Choose k, r to be integers in d/4, 3d/4
- Then for any constant d gt 0, there is a constant
gt 0, so that - Pr(X,Y) E(X, Y) d1/2 1-
d - Follows from standard deviation of
hypergeometric distribution
15Techniques Lower Bound
- (X,Y) 2F0(aX?aY)-wt(X)-wt(Y)
- Note that Bob has wt(X) and wt(Y), so he can
compute - wt(X) wt(Y) wt(X)wt(Y)/d
- If F is the output of the streaming algorithm,
Bob simply computes - 2F-wt(X)-wt(Y) and checks if it is
greater than - F 2 (1-²)F0, (1²)F0 with large constant
probability - 2F-wt(X)-wt(Y) 2 2F0-wt(X)-wt(Y)-2²F0,,
2F0-wt(X)-wt(Y)2²F0 - (X,Y)-2²F0, (X,Y)
2²F0 -
(X,Y)-(²d), (X,Y) (²d) - (X,Y) - (d1/2), (X,Y) (d1/2)
16Techniques Lower Bound
- By our lemmas, and using that E(X, Y) ,
with large constant probability either (X,Y) gt
d1/2 or (X,Y) lt - d1/2 - By previous slide, Bobs value 2F-wt(X)-wt(Y) is
in the range - (X,Y) - (d1/2), (X,Y) (d1/2)
- If (X,Y) gt d1/2, then 2F-wt(X)-wt(Y) gt
d1/2 -(d1/2) - If (X,Y) lt - d1/2, then 2F-wt(X)-wt(Y) lt -
d1/2 (d1/2) - So Bob can use the output of the streaming
algorithm to solve the communication problem, so
S CC
17Lower Bounding CC
- Communication game Alice is given X and wt(Y),
while Bob is given Y and wt(X), and they want to
know if - Delta(X,Y) wt(X) wt(Y)-2wt(X)wt(Y)/d
- Here X,Y have the distribution of being
independent characteristic vectors of a stream on
n/2 samples, where - 1-(1-1/d)n/2 2 1/3, 2/3
With large probability, Prd/4 wt(X), wt(Y)
3d/4 1-o(1) By averaging, a correct protocol
is also correct with large probability for fixed
weights i and j in d/4, 3d/4, so we can assume
X is a random string of Hamming weight i, and Y a
random string of Hamming weight j.
18Lower Bounding CC
Y such that wt(Y) j
X such that wt(X) i
Since i, j 2 d/4, 3d/4, one can show that the
fraction of 1s in each row is in
1/2-o(1),1/2o(1) Recall that an entry is 1
if and only if (X,Y) i j 2ij/d
With large probability, the message M that Alice
sends has the property that many different X
cause Alice to send M. Say such an M is large.
We show that for any large M, the fraction of 1s
in most of the columns is in, say, 1/10, 9/10.
Then Bob doesnt know what to output.
Since each row is roughly balanced, the expected
fraction of 1s in each column is in 1/2-o(1),
1/2o(1).
19Lower Bounding CC
Since each row is roughly balanced, the expected
fraction of 1s in each column is in 1/2-o(1),
1/2o(1). But the variance could be huge,
e.g., the matrix may look like this, in which
case Bob can easily output the answer.
20Lower Bounding CC
- Can show this doesnt happen by the second-moment
method. - Let Vy be the fraction of 1s in the column
indexed by y. - Let S be the set of x which cause Alice to send a
large message M. - Consider random Y. Let Cu 1 if and only if
(u,Y) gt . So V (1/S) sumu in S Cu - VarV (1/S2) sumu,v in S (ECu Cv
ECuECv). - Notice that ECuECv 2 1/4-o(1), 1/4o(1)
- while ECu Cv Pr(u,Y) gt (v,Y) gt
½. - Since S is a large set, most pairs u, v 2 S have
Hamming distance in d/5, 4d/5. - A technical lemma shows that Pr(u,Y) gt
(v,Y) gt is a constant strictly less than 1. - Hence, ECu Cv is a constant strictly less than
1/2, and since this happens for most pairs u, v,
we get that VarV is a constant strictly less
than 1/4.
21Lower Bounding CC
- Fraction of 1s in a random column is just V
(1/S) sumu in S Cu - Let be a small positive constant. By
Chebyshevs inequality, - PrV-1/2 gt 1/2- lt PrV-EV gt ½--o(1)
VarV/(1/2-)2 o(1) - But we showed VarV is a constant strictly less
than ¼, so this probability is a constant
strictly less than 1 for small enough . - Hence, for a random column Y, the fraction of 1s
is in , 1- with constant probability. - It follows that with some constant probability
Bob outputs the wrong answer. - Hence, most messages of Alice must be small, so
one can show that there must be 2?(d) of them, so
that communication is ?(d) ?(1/²2).
22Conclusions
- Introduced random data model, and studied
F0-estimation under distributions uniform over a
subset of d items - For a certain range of d, we show that one can
beat the space lower bound that holds for
adversarial data and randomly-ordered data - For another choice of d, we show the lower bound
for adversarial and randomly-ordered data also
applies in this setting - Are there other natural distributions that admit
more space-efficient algorithms in this model?
Are there other useful ways of bypassing the
?(1/²2) lower bound?