The Average Case Complexity of Counting Distinct Elements - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

The Average Case Complexity of Counting Distinct Elements

Description:

Given a data stream of n insertions of records, count the number F0 of ... Suppose we are mining salaries, and they are ordered alphabetically by surname. ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 23

Provided by: IBMU328

Category:

more less

Transcript and Presenter's Notes

Title: The Average Case Complexity of Counting Distinct Elements

1
The Average Case Complexity of Counting Distinct
Elements

David Woodruff
IBM Almaden

2
Problem Description

Given a data stream of n insertions of records,
count the number F0 of distinct records
One pass over the data stream
Algorithms must use small amount of memory and
have fast update time
too expensive to store set of distinct records
implies algorithms must be randomized and must
settle for an approximate solution output
F 2 (1-²)F0, (1²)F0 with constant
probability

3
The Data Stream

How is the data in the stream organized?
Usually, assume worst-case ordered
In this case, (1/²2) bits are necessary and
sufficient to estimate F0
As the quality of approximation improves to say,
² 1, the quadratic dependence is a major
drawback
Sometimes a random-ordering can be assumed
Suppose we are mining salaries, and they are
ordered alphabetically by surname. If there is no
correlation between salaries and surname, then
the stream of salaries is ordered randomly
The backing sample architecture assumes data
randomly ordered by design (Gibbons, Matias,
Poosala)
This model is referred to as the Random-Order
Model (Guha, McGregor)
Unfortunately, even in this model, we still need
?(1/²2) bits to estimate F0
(Chakrabarti, Cormode, McGregor).
Intuitively this is because the data is still
worst-case.

4
Random Data Model

In an attempt to bypass the ?(1/²2) bound, we
propose to study the case when the data comes
from an underlying distribution.
Problem 3 of Muthukrishnans book Provide
inproved estimates for Lp sums including distinct
element estimation if input stream has
statistical properties such as being Zipfian.
There is a distribution defined by probabilities
pi, 1 i m, ?i pi 1.
The next item in the stream is chosen
independently of previous
items, and is i with probability pi.
We call this the Random-Data Model.
The Random-Data Model is contained in the
Random-Order Model.

5
Random Data Model

This model for F0 was implicitly studied before
by Motwani and Vassilvitskii when the
distribution is Zipfian. This distribution is
useful for estimating WebGraph statistics
sampling-based algorithms used in practice impose
distributional assumptions, without which they
have poor performance (Charikar et al)
Generalized Inverse Gaussian Poisson (GIGP) model
studies sampling-based estimators when
distribution is uniform and Zipfian
by Guha and McGregor for estimating the density
function of an unknown distribution, which is
useful in learning theory

6
Further Restriction

We focus on the case when each probability pi
1/d for an unknown value of d (so uniform over a
subset of m)
Captures the setting of sampling with replacement
a set of unknown cardinality
For a certain range of d, we show that one can
beat the space lower bound that holds for
adversarial data and randomly-ordered data
For another choice of d, we show the lower bound
for adversarial and randomly-ordered data also
applies in this setting
Distribution fairly robust in the sense that
other distributions with a few heavy items and
remaining items that are approximately uniform
have the same properties above

7
Our Upper Bound

1-pass algorithm with an expected O(d(log
1/²)/(n²2) log m) bits of space, whenever d
1/²2 and d n. The per-item processing time is
constant.
Recall the distribution is uniform over a
d-element subset of m, and we see n samples
from it, so this is a typical setting of
parameters.
Notice for n even slightly larger than d, the
algorithm does much better than the ?(1/²2) lower
bound in other data stream models.
One can show for every combination of known
algorithms with different space/time tradeoffs
for F0 in the adversarial model, our algorithm is
either better in space or in time.

8
Our Lower Bound

Our main technical result is that if n and d are
(1/²2), then even estimating F0 in the random
data model requires ?(1/²2) space
Lower bound subsumes previous lower bounds,
showing that even for a natural (random) choice
of data, the problem is hard
Our choice of distribution for showing the lower
bound was used in subsequent work by Chakrabarti
and Brody, which turned out to be useful for
establishing an ?(1/²2) lower bound for constant
pass algorithms for estimating F0

9
Techniques Upper Bound

Very simple observation
Since d n, each item should have frequency n/d
in the stream.
If n/d is at least ?(1/²2), we can just compute n
and the frequency of the first item in the stream
to get a (1²)-approximation to d
Using a balls-and-bins occupancy bound of Kamath
et al, a good estimation of d implies a good
estimation to F0
If n/d is less than 1/²2, we could instead store
the first O(1/²2) items (hashed appropriately for
small space), treat them as a set S, and count
the number of items in the remaining part of the
stream that land in S
Correct, but unnecessary if d is much less than n.

10
Techniques Upper Bound

Instead record the first item x in the stream,
and find the second position i of x in the
stream.
Position i should occur at roughly the d-th
position in the stream, so i provides a constant
factor approximation to d
Since n ?(d), position i should be in the first
half of the stream with large constant
probability.
Now store the first i/(n²2) distinct stream
elements in the second half of stream, treat them
as a set S, and count the remaining items in the
stream that occur in S.
Good enough for (1²)-approximation
Space is O(log n (i log m)/(n²2)) ¼ O(log n
(d log m)/(n²2))

11
Techniques Upper Bound

Space is O(log n (d log m)/(n²2)), but we can
do better.
For each j in m, sample j independently with
probability 1/(i²2). In expectation new
distribution is now uniform over d/(i²2) items.
If j is sampled, say j survives.
Go back to the previous step store the first
i/(n²2) distinct surviving stream elements in the
second half of stream, treat them as a set S, and
count the remaining items in the stream that
occur in S.
Since only (1/²2) items survive, can store S
with only (i log 1/²)/(n²2) bits by hashing item
IDs down to a range of size, say, 1/²5
We estimate the distributions support size in
the sub-sampled stream, which is roughly d/(i²2).
We can get a (1²)-approximation to this quantity
provided it is at least ?(1/²2), which it is with
high probability. Then scale by i²2 to estimate
d, and thus F0 by previous reasoning.
Constant update time

12
Techniques Lower Bound

Consider the uniform distribution ¹ on d 1,
2, , d. d and n are (1/²2)
Consider a stream of n samples from ¹ where we
choose n so that
1-(1-1/d)n/2
in 1/3, 2/3
Let X be the characteristic vector of the first
n/2 stream samples on the universe d. So Xi 1
if and only if item i occurs in these samples.
Let Y be the characteristic vector of the second
n/2 stream samples on the universe d. So Yi 1
if and only if item i occurs in these samples.
Let wt(X), wt(Y) be the number of ones in vectors
X, Y.
We consider a communication game. Alice is given
X and wt(Y), while Bob is given Y and wt(X), and
they want to know if
Delta(X,Y) wt(X) wt(Y)-2wt(X)wt(Y)/d
Alice and Bob should solve this problem with
large probability, where the probability is over
the choice of X and Y, and their coin tosses
(which can be fixed).

13
Techniques Lower Bound

Strategy
(1) show the space complexity S of the streaming
algorithm is at least the one-way communication
complexity CC of the game
(2) lower bound CC
Theorem S CC
Proof Alice runs the streaming algorithm on a
random stream aX generated by her characteristic
vector X. Alice transmits the state to Bob
Bob continues the computation of the
streaming algorithm on a random stream aY with
characteristic vector Y
At the end the algorithm estimates F0 of a
stream whose elements are in the support of X or
Y (or both)
Notice that the two halves of the stream are
independent in the random data model, so the
stream generated has the right distribution

14
Techniques Lower Bound

We show the estimate of F0 can solve the
communication game
Remember, the communication game is to decide
whether
Delta(X,Y) wt(X) wt(Y)-2wt(X)wt(Y)/d
Note the quantity on the right is the expected
value of Delta(X,Y)
Some Lemmas
(1) Prd/4 wt(X), wt(Y) 3d/4 1-o(1)
(2) Consider the variable X, distributed as
X, but conditioned on wt(X) k. The variable Y
is distributed as Y, but conditioned on wt(Y)
r.
X and Y are uniform over k and r bit strings,
respectively
Choose k, r to be integers in d/4, 3d/4
Then for any constant d gt 0, there is a constant
gt 0, so that
Pr(X,Y) E(X, Y) d1/2 1-
d
Follows from standard deviation of
hypergeometric distribution

15
Techniques Lower Bound

(X,Y) 2F0(aX?aY)-wt(X)-wt(Y)
Note that Bob has wt(X) and wt(Y), so he can
compute
wt(X) wt(Y) wt(X)wt(Y)/d
If F is the output of the streaming algorithm,
Bob simply computes
2F-wt(X)-wt(Y) and checks if it is
greater than
F 2 (1-²)F0, (1²)F0 with large constant
probability
2F-wt(X)-wt(Y) 2 2F0-wt(X)-wt(Y)-2²F0,,
2F0-wt(X)-wt(Y)2²F0
(X,Y)-2²F0, (X,Y)
2²F0
(X,Y)-(²d), (X,Y) (²d)
(X,Y) - (d1/2), (X,Y) (d1/2)

16
Techniques Lower Bound

By our lemmas, and using that E(X, Y) ,
with large constant probability either (X,Y) gt
d1/2 or (X,Y) lt - d1/2
By previous slide, Bobs value 2F-wt(X)-wt(Y) is
in the range
(X,Y) - (d1/2), (X,Y) (d1/2)
If (X,Y) gt d1/2, then 2F-wt(X)-wt(Y) gt
d1/2 -(d1/2)
If (X,Y) lt - d1/2, then 2F-wt(X)-wt(Y) lt -
d1/2 (d1/2)
So Bob can use the output of the streaming
algorithm to solve the communication problem, so
S CC

17
Lower Bounding CC

Communication game Alice is given X and wt(Y),
while Bob is given Y and wt(X), and they want to
know if
Delta(X,Y) wt(X) wt(Y)-2wt(X)wt(Y)/d
Here X,Y have the distribution of being
independent characteristic vectors of a stream on
n/2 samples, where
1-(1-1/d)n/2 2 1/3, 2/3

With large probability, Prd/4 wt(X), wt(Y)
3d/4 1-o(1) By averaging, a correct protocol
is also correct with large probability for fixed
weights i and j in d/4, 3d/4, so we can assume
X is a random string of Hamming weight i, and Y a
random string of Hamming weight j.
18
Lower Bounding CC
Y such that wt(Y) j
X such that wt(X) i
Since i, j 2 d/4, 3d/4, one can show that the
fraction of 1s in each row is in
1/2-o(1),1/2o(1) Recall that an entry is 1
if and only if (X,Y) i j 2ij/d
With large probability, the message M that Alice
sends has the property that many different X
cause Alice to send M. Say such an M is large.
We show that for any large M, the fraction of 1s
in most of the columns is in, say, 1/10, 9/10.
Then Bob doesnt know what to output.
Since each row is roughly balanced, the expected
fraction of 1s in each column is in 1/2-o(1),
1/2o(1).
19
Lower Bounding CC
Since each row is roughly balanced, the expected
fraction of 1s in each column is in 1/2-o(1),
1/2o(1). But the variance could be huge,
e.g., the matrix may look like this, in which
case Bob can easily output the answer.
20
Lower Bounding CC

Can show this doesnt happen by the second-moment
method.
Let Vy be the fraction of 1s in the column
indexed by y.
Let S be the set of x which cause Alice to send a
large message M.
Consider random Y. Let Cu 1 if and only if
(u,Y) gt . So V (1/S) sumu in S Cu
VarV (1/S2) sumu,v in S (ECu Cv
ECuECv).
Notice that ECuECv 2 1/4-o(1), 1/4o(1)
while ECu Cv Pr(u,Y) gt (v,Y) gt
½.
Since S is a large set, most pairs u, v 2 S have
Hamming distance in d/5, 4d/5.
A technical lemma shows that Pr(u,Y) gt
(v,Y) gt is a constant strictly less than 1.
Hence, ECu Cv is a constant strictly less than
1/2, and since this happens for most pairs u, v,
we get that VarV is a constant strictly less
than 1/4.

21
Lower Bounding CC

Fraction of 1s in a random column is just V
(1/S) sumu in S Cu
Let be a small positive constant. By
Chebyshevs inequality,
PrV-1/2 gt 1/2- lt PrV-EV gt ½--o(1)
VarV/(1/2-)2 o(1)
But we showed VarV is a constant strictly less
than ¼, so this probability is a constant
strictly less than 1 for small enough .
Hence, for a random column Y, the fraction of 1s
is in , 1- with constant probability.
It follows that with some constant probability
Bob outputs the wrong answer.
Hence, most messages of Alice must be small, so
one can show that there must be 2?(d) of them, so
that communication is ?(d) ?(1/²2).

22
Conclusions

Introduced random data model, and studied
F0-estimation under distributions uniform over a
subset of d items
For a certain range of d, we show that one can
beat the space lower bound that holds for
adversarial data and randomly-ordered data
For another choice of d, we show the lower bound
for adversarial and randomly-ordered data also
applies in this setting
Are there other natural distributions that admit
more space-efficient algorithms in this model?
Are there other useful ways of bypassing the
?(1/²2) lower bound?