Finding Duplicates in a Data Stream - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Finding Duplicates in a Data Stream

Description:

Finding Duplicates in a Data Stream. Parikshit Gopalan MSR-Silicon Valley ... Main Result: A one-pass randomized algorithm for finding duplicates in O(log m)3 space. ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 41
Provided by: ResearchM53
Category:

less

Transcript and Presenter's Notes

Title: Finding Duplicates in a Data Stream


1
Finding Duplicates in a Data Stream
Jaikumar Radhakrishnan TIFR Mumbai
  • Parikshit Gopalan MSR-Silicon Valley

Work sponsored in part by the USCIS.
2
The Problem
  • A stream of length n over m.
  • n gt m so some item repeats by .
  • Find a repeated item.
  • Studied in databases, networking etc.
  • Pigeonhole is crucial!

3
(No Transcript)
4
(No Transcript)
5
Three Easy Pieces
  • RAM model O(log m) space, deterministic.
  • Streaming, deterministic
  • Tarui07 Hard!
  • - ?(m) space for 1 pass.
  • - m?(1/k) space for k passes.
  • 3. Streaming, randomized
  • - Easy when n gt 2m with 1 pass.
  • - What happens when n m 1?

6
What can you do in one pass?
  • Main Result A one-pass randomized algorithm for
    finding duplicates in O(log m)3 space.
  • Returns a duplicate with probability 0.9.
  • Can relax the condition n gt m (in many ways).
  • Reduction to Dictatorship testing for
    Halfspaces.
  • Reduction via a new Isolation Lemma.
  • Talk Outline
  • Algorithm for FindDuplicate.
  • Some extensions.

7
Change of Notation
  • Stream of increments
  • i1, c1, i2,c2, i3,c3, , in,cn
  • where ij 2 m, cj 2 -1,1.
  • Defines a cumulative frequency vector
  • f f1, , fm.
  • FindDuplicate restated
  • i1,1, i2,1, i3,1, , in,1
  • Find i 2 m with fi 2.

8
Change of Problem
  • FindPositive Given f as a stream of increments
    where ?i 2 m fi gt 0, find fi 1.
  • FindDuplicate reduces to FindPositive
  • Pre-pend 1,-1, 2,-1,3,-1, m,-1
  • to the input i1,1, i2,1, i3,1, ,
    in,1.
  • Comparing streams of unequal length
  • Given S1 and S2 over m, where S1 gt S2 find
    i 2 S1\S2.

Parameters for FindPositive l(f) ?fi gt
0fi, l-(f) ?fi lt 0fi. Assume l(f)
m1, l-(f) m.
9
Finding Positives in Three Easy Steps
  • Isolate Pick S µ m which is a dictatorship.
  • We say i 2 S is the dictator of S if fi gt ?j
    ? i 2 Sfj.
  • 2. Find Find a candidate Dictator.
  • 3. Verify Test if the suspect is indeed a
    Dictator.

10
Isolate the Easy Cases
Case I Single positive i. m is a dictatorship.
11
Isolate the Easy Cases
Case II Every fi 2 -1,1. Sample a single
random element.
12
Isolate the Easy Cases
Case III k positives of size (m1)/k. Negatives
adversarial, sum to m.
  • Pick S by sampling with frequency 1/k.
  • Single positive of size (m1)/k.
  • Sum of negatives m/k.
  • With prob. ¼, both events happen.

13
Isolation Lemma
  • Pick t 2 0,, log m.
  • Pick S by sampling from m w.p 2-t.
  • Then S is a dictatorship w.p ?(1/log m).
  • Oblivious to the vector f.
  • Pairwise independence suffices S described in
    O(log m) bits.
  • Repeat k O(log m) times for constant success
    probability.

14
Isolation Lemma
  • Dictatorship Game
  • Adversary picks f, we pick S.
  • - We win if S is a dictatorship.

Adversarys strategy - Pick t randomly, k
2t. - k positives of size (m1)/k.
  • Our strategy
  • - Guess t, sample at freq. 2-t.
  • Works for any f w.p 1/(log m).

15
Isolation Lemma
  • Pick t 2 0,, log m.
  • Pick S by sampling from m w.p 2-t.
  • Then S is a dictatorship w.p ?(1/log m).
  • Proof Sketch
  • Many numbers of some size For any f, there
    are k/log m positives of size at least m/k for
    some k.
  • k is a power of 2.
  • - Gives 1/(log m)2.
  • - Account for all sizes gives 1/(log m).

16
Finding Positives in Three Easy Steps
  • Isolate Pick S µ m which is a dictatorship.
  • We say i 2 S is the dictator of S if fi gt ?j
    ? i 2 Sfj.
  • 2. Find Find a candidate Dictator.
  • 3. Verify Test if the suspect is indeed a
    Dictator.

17
Finding Positives in Three Easy Steps
  • Isolate Pick S µ m which is a dictatorship.
  • We say i 2 S is the dictator of S if fi gt ?j
    ? i 2 Sfj.
  • 2. Find Find a candidate Dictator.
  • 3. Verify Test if the suspect is indeed a
    Dictator.

18
HalfSpaces
  • Every S µ m defines a halfspace fS
    -1,1m ! -1,1 given by
  • fS(x) sign( ?i 2 Sfixi )

19
HalfSpaces
  • Every S µ m defines a halfspace fS
    -1,1m ! -1,1 given by
  • fS(x) sign( ?i 2 Sfixi )

Easy to compute for succinct x.
1
-1
If S is the dictatorship of i, fS(x) xi.
20
Finding the Dictator
  • Compute fS(x) for

m/2
m
1
x1
x2
xlog m
21
Finding the Dictator
Compute fS(x) for all S and x
fS1
fSk
22
Finding Positives in Three Easy Steps
  • Isolate Pick S µ m which is a dictatorship.
  • We say i 2 S is the dictator of S if fi gt ?j
    ? i 2 Sfj.
  • 2. Find Find a candidate Dictator.
  • 3. Verify Test if the suspect is indeed a
    Dictator.

23
Finding Positives in Three Easy Steps
  • Isolate Pick S µ m which is a dictatorship.
  • We say i 2 S is the dictator of S if fi gt ?j
    ? i 2 Sfj.
  • 2. Find Find a candidate Dictator.
  • 3. Verify Test if the suspect is indeed a
    Dictator.

24
Verify
  • Given S and i 2 S, test if i is a dictator.
  • Dictatorship Test 1.0
  • - Compute fS(x) for a few random xs.
  • Test if fS(x) xi.

25
Find Verify
i
Find
fS
fS(x) xi?
  • Pseudo-random xs.
  • Soundness?

Verify
26
Verify
  • Given S and i 2 S, test if i is a dictator.
  • Dictatorship Test 1.0
  • - Compute fS(x) for a few random xs.
  • Test if fS(x) xi.
  • ? Not very sound.
  • ? Doesnt need to be.

27
Verify
  • Given S and i 2 S, test if i is a dictator.
  • Dictatorship Test 2.0
  • Test if fS(x) xi.
  • Completeness Accept if i is the dictator.
  • Soundness Reject if fi 0.

- If fi 0, then Cor(f,xi) 0. - Use Nisan
or 4-wise independence.
28
i1
fS1
Test
ij
fSj
Test
ik
fSk
Test
29
i1
fS1
Test
ij
fSj
Test
ik
fSk
Test
30
Finding Positives in Three Easy Steps
  • Isolate Pick S µ m which is a dictatorship.
  • We say i 2 S is the dictator of S if fi gt ?j
    ? i 2 Sfj.
  • 2. Find Find a candidate Dictator.
  • 3. Verify Test if the suspect is indeed a
    Dictator.

31
Finding Positives in Three Easy Steps
  • Isolate Pick S µ m which is a dictatorship.
  • We say i 2 S is the dictator of S if fi gt ?j
    ? i 2 Sfj.
  • 2. Find Find a candidate Dictator.
  • 3. Verify Test if the suspect is at least a
    Positive.
  • What if l1(f) gt ? l1-(f) for some ? lt 1 ?
  • Can be solved in ?-1log3 m space.
  • FindDuplicate for n m k in space k log3 m.
  • Matching lower bound.
  • Generalized Isolation lemma.

32
FindDuplicate in l2
  • What if l2(f) gt l2-(f) ?
  • Incomparable to l1(f) gt l1-(f) in general.
  • Weaker guarantee for FindDuplicate.
  • A 1-pass algorithm for l2-FindDuplicate in O(log
    m)3 space.
  • Reduction to Finding a Near-Dictator in a
    Halfspace.
  • Finding Near-Dictators.

33
NearDictators
  • We say i 2 S is the near-dictator of S if fi
    gt 10(?j ? i 2 Sfj2)1/2.
  • The sum ?j ? i 2 Sfjxj is concentrated around
    (?j ? i 2 Sfj2)1/2 for most x.
  • The halfspace fS(x) is close to being the
    dictatorship of xi.

34
Finding the Dictator
  • Compute fS(x) for

m/2
m
1
x1
x2
xlog m
35
NearDictators
  • We say i 2 S is the near-dictator of S if fi
    gt 10(?j ? i 2 Sfj2)1/2.
  • The sum ?j ? i 2 Sfjxj is concentrated around
    (?j ? i 2 Sfj2)1/2 for most x.
  • For random x, fS(x) is close to the dictatorship
    of xi.
  • Even for pair-wise independent x,
  • Prf(x) xi gt 0.9

36
Finding NearDictators
  • The Hadamard distribution.
  • Let k log m. Pick y1,,yk 2 -1,1.
  • Associate each i 2 m with ?i µ log m.
  • Set xi ?j 2 ?iyj.
  • Halfspace
  • f(x) ?ifixi
  • sgn(f) xi with probability 0.99
  • Polynomial
  • p(x) ??if(?i) ?j 2 ?iyj
  • sgn(p) ?j 2 ?iyj with probability 0.9

37
Finding NearDictators
  • Given the coefficients f(?i) of p(y).
  • sgn(p(y)) ?j 2 ?iyj with probability 0.9.
  • Find the set ?i.
  • Can evaluate p(y) for any y 2 -1,1k.
  • Hadamard (unique) decoding.
  • Given p-1,1k ! -1,1, find a parity that has
    agreement 0.9.
  • Can be solved with O(klog k) queries.

38
Extensions
  • l1(f) gt ? l1-(f) for some ? lt 1.
  • Can be solved in ?-1log3 m space.
  • FindDuplicate for n m k in space k log3 m.
  • Generalized Isolation lemma.
  • l2(f) gt l2-(f).
  • Can be solved in log3 m space.
  • Weaker requirement for finding duplicates.
  • Finding NearDictators via Hadamard decoding.
  • lp(f) gt lp-(f) for p gt2.
  • Requires m?(1) space.

39
Open Problems
  • Conjecture FindDuplicate can be solved using
    O(log m) space in one pass.
  • PRGs for Halfspaces A generator that fools
    halfspaces.
  • A hitting set generator Rabani-Shpilka08
  • Sample duplicates according to frequency?

40
Thank You.
Write a Comment
User Comments (0)
About PowerShow.com