Title: Finding Duplicates in a Data Stream
1Finding Duplicates in a Data Stream
Jaikumar Radhakrishnan TIFR Mumbai
- Parikshit Gopalan MSR-Silicon Valley
Work sponsored in part by the USCIS.
2The Problem
- A stream of length n over m.
- n gt m so some item repeats by .
- Find a repeated item.
- Studied in databases, networking etc.
- Pigeonhole is crucial!
3(No Transcript)
4(No Transcript)
5Three Easy Pieces
- RAM model O(log m) space, deterministic.
- Streaming, deterministic
- Tarui07 Hard!
- - ?(m) space for 1 pass.
- - m?(1/k) space for k passes.
- 3. Streaming, randomized
- - Easy when n gt 2m with 1 pass.
- - What happens when n m 1?
6What can you do in one pass?
- Main Result A one-pass randomized algorithm for
finding duplicates in O(log m)3 space. - Returns a duplicate with probability 0.9.
- Can relax the condition n gt m (in many ways).
- Reduction to Dictatorship testing for
Halfspaces. - Reduction via a new Isolation Lemma.
- Talk Outline
- Algorithm for FindDuplicate.
- Some extensions.
7Change of Notation
- Stream of increments
- i1, c1, i2,c2, i3,c3, , in,cn
- where ij 2 m, cj 2 -1,1.
- Defines a cumulative frequency vector
- f f1, , fm.
- FindDuplicate restated
- i1,1, i2,1, i3,1, , in,1
- Find i 2 m with fi 2.
8Change of Problem
- FindPositive Given f as a stream of increments
where ?i 2 m fi gt 0, find fi 1. - FindDuplicate reduces to FindPositive
- Pre-pend 1,-1, 2,-1,3,-1, m,-1
- to the input i1,1, i2,1, i3,1, ,
in,1. - Comparing streams of unequal length
- Given S1 and S2 over m, where S1 gt S2 find
i 2 S1\S2.
Parameters for FindPositive l(f) ?fi gt
0fi, l-(f) ?fi lt 0fi. Assume l(f)
m1, l-(f) m.
9Finding Positives in Three Easy Steps
- Isolate Pick S µ m which is a dictatorship.
- We say i 2 S is the dictator of S if fi gt ?j
? i 2 Sfj. - 2. Find Find a candidate Dictator.
- 3. Verify Test if the suspect is indeed a
Dictator.
10Isolate the Easy Cases
Case I Single positive i. m is a dictatorship.
11Isolate the Easy Cases
Case II Every fi 2 -1,1. Sample a single
random element.
12Isolate the Easy Cases
Case III k positives of size (m1)/k. Negatives
adversarial, sum to m.
- Pick S by sampling with frequency 1/k.
- Single positive of size (m1)/k.
- Sum of negatives m/k.
- With prob. ¼, both events happen.
13Isolation Lemma
- Pick t 2 0,, log m.
- Pick S by sampling from m w.p 2-t.
- Then S is a dictatorship w.p ?(1/log m).
- Oblivious to the vector f.
- Pairwise independence suffices S described in
O(log m) bits. - Repeat k O(log m) times for constant success
probability.
14Isolation Lemma
- Dictatorship Game
- Adversary picks f, we pick S.
- - We win if S is a dictatorship.
Adversarys strategy - Pick t randomly, k
2t. - k positives of size (m1)/k.
- Our strategy
- - Guess t, sample at freq. 2-t.
- Works for any f w.p 1/(log m).
15Isolation Lemma
- Pick t 2 0,, log m.
- Pick S by sampling from m w.p 2-t.
- Then S is a dictatorship w.p ?(1/log m).
- Proof Sketch
- Many numbers of some size For any f, there
are k/log m positives of size at least m/k for
some k. - k is a power of 2.
- - Gives 1/(log m)2.
- - Account for all sizes gives 1/(log m).
16Finding Positives in Three Easy Steps
- Isolate Pick S µ m which is a dictatorship.
- We say i 2 S is the dictator of S if fi gt ?j
? i 2 Sfj. - 2. Find Find a candidate Dictator.
- 3. Verify Test if the suspect is indeed a
Dictator.
17Finding Positives in Three Easy Steps
- Isolate Pick S µ m which is a dictatorship.
- We say i 2 S is the dictator of S if fi gt ?j
? i 2 Sfj. - 2. Find Find a candidate Dictator.
- 3. Verify Test if the suspect is indeed a
Dictator.
18HalfSpaces
- Every S µ m defines a halfspace fS
-1,1m ! -1,1 given by - fS(x) sign( ?i 2 Sfixi )
19HalfSpaces
- Every S µ m defines a halfspace fS
-1,1m ! -1,1 given by - fS(x) sign( ?i 2 Sfixi )
Easy to compute for succinct x.
1
-1
If S is the dictatorship of i, fS(x) xi.
20Finding the Dictator
m/2
m
1
x1
x2
xlog m
21Finding the Dictator
Compute fS(x) for all S and x
fS1
fSk
22Finding Positives in Three Easy Steps
- Isolate Pick S µ m which is a dictatorship.
- We say i 2 S is the dictator of S if fi gt ?j
? i 2 Sfj. - 2. Find Find a candidate Dictator.
- 3. Verify Test if the suspect is indeed a
Dictator.
23Finding Positives in Three Easy Steps
- Isolate Pick S µ m which is a dictatorship.
- We say i 2 S is the dictator of S if fi gt ?j
? i 2 Sfj. - 2. Find Find a candidate Dictator.
- 3. Verify Test if the suspect is indeed a
Dictator.
24Verify
- Given S and i 2 S, test if i is a dictator.
- Dictatorship Test 1.0
- - Compute fS(x) for a few random xs.
- Test if fS(x) xi.
25Find Verify
i
Find
fS
fS(x) xi?
- Pseudo-random xs.
- Soundness?
Verify
26Verify
- Given S and i 2 S, test if i is a dictator.
- Dictatorship Test 1.0
- - Compute fS(x) for a few random xs.
- Test if fS(x) xi.
- ? Not very sound.
- ? Doesnt need to be.
27Verify
- Given S and i 2 S, test if i is a dictator.
- Dictatorship Test 2.0
- Test if fS(x) xi.
- Completeness Accept if i is the dictator.
- Soundness Reject if fi 0.
- If fi 0, then Cor(f,xi) 0. - Use Nisan
or 4-wise independence.
28i1
fS1
Test
ij
fSj
Test
ik
fSk
Test
29i1
fS1
Test
ij
fSj
Test
ik
fSk
Test
30Finding Positives in Three Easy Steps
- Isolate Pick S µ m which is a dictatorship.
- We say i 2 S is the dictator of S if fi gt ?j
? i 2 Sfj. - 2. Find Find a candidate Dictator.
- 3. Verify Test if the suspect is indeed a
Dictator.
31Finding Positives in Three Easy Steps
- Isolate Pick S µ m which is a dictatorship.
- We say i 2 S is the dictator of S if fi gt ?j
? i 2 Sfj. - 2. Find Find a candidate Dictator.
- 3. Verify Test if the suspect is at least a
Positive.
- What if l1(f) gt ? l1-(f) for some ? lt 1 ?
- Can be solved in ?-1log3 m space.
- FindDuplicate for n m k in space k log3 m.
- Matching lower bound.
- Generalized Isolation lemma.
32FindDuplicate in l2
- What if l2(f) gt l2-(f) ?
- Incomparable to l1(f) gt l1-(f) in general.
- Weaker guarantee for FindDuplicate.
- A 1-pass algorithm for l2-FindDuplicate in O(log
m)3 space. - Reduction to Finding a Near-Dictator in a
Halfspace. - Finding Near-Dictators.
33NearDictators
- We say i 2 S is the near-dictator of S if fi
gt 10(?j ? i 2 Sfj2)1/2. - The sum ?j ? i 2 Sfjxj is concentrated around
(?j ? i 2 Sfj2)1/2 for most x. - The halfspace fS(x) is close to being the
dictatorship of xi.
34Finding the Dictator
m/2
m
1
x1
x2
xlog m
35NearDictators
- We say i 2 S is the near-dictator of S if fi
gt 10(?j ? i 2 Sfj2)1/2. - The sum ?j ? i 2 Sfjxj is concentrated around
(?j ? i 2 Sfj2)1/2 for most x. - For random x, fS(x) is close to the dictatorship
of xi. - Even for pair-wise independent x,
- Prf(x) xi gt 0.9
36Finding NearDictators
- The Hadamard distribution.
- Let k log m. Pick y1,,yk 2 -1,1.
- Associate each i 2 m with ?i µ log m.
- Set xi ?j 2 ?iyj.
- Halfspace
- f(x) ?ifixi
- sgn(f) xi with probability 0.99
- Polynomial
- p(x) ??if(?i) ?j 2 ?iyj
- sgn(p) ?j 2 ?iyj with probability 0.9
37Finding NearDictators
- Given the coefficients f(?i) of p(y).
- sgn(p(y)) ?j 2 ?iyj with probability 0.9.
- Find the set ?i.
- Can evaluate p(y) for any y 2 -1,1k.
- Hadamard (unique) decoding.
- Given p-1,1k ! -1,1, find a parity that has
agreement 0.9. - Can be solved with O(klog k) queries.
38Extensions
- l1(f) gt ? l1-(f) for some ? lt 1.
- Can be solved in ?-1log3 m space.
- FindDuplicate for n m k in space k log3 m.
- Generalized Isolation lemma.
- l2(f) gt l2-(f).
- Can be solved in log3 m space.
- Weaker requirement for finding duplicates.
- Finding NearDictators via Hadamard decoding.
- lp(f) gt lp-(f) for p gt2.
- Requires m?(1) space.
39Open Problems
- Conjecture FindDuplicate can be solved using
O(log m) space in one pass. - PRGs for Halfspaces A generator that fools
halfspaces. - A hitting set generator Rabani-Shpilka08
- Sample duplicates according to frequency?
40Thank You.