Title: Exact and Approximate Pattern in the Streaming Model
1Exact and Approximate Pattern in the Streaming
Model
Benny Porat and Ely Porat 2009 FOCS
- Presented by - Tanushree Mitra
2Problem Statement
- Find all instances of pattern P of length m, as a
contiguous substring in a text string T, of
length n, where m lt n.
3Contributions
- Exact pattern matching - A fully online
randomized algorithm for the classical pattern
matching problem - Time complexity - O(logm) per character that
arrives - Space complexity - O(logm), breaking the O(m)
barrier that held for this problem for a long
time. - Approximate pattern matching An algorithm for
pattern matching with k mismatches problem. - Time complexity - O(k2poly(logm)) per
character - Space complexity - O(k3poly(logm))
4Applications
- Monitoring Internet traffic
- Computational Biology
- Large Scale web searching
- Viruses and Malware detection
- Automatic Stock market analysis
- Robotics
5Background
- Brute Force Algorithm
- Slide the pattern along the text and
- Compare it to the corresponding portion of the
text - Time Complexity O(mn)
- Speedup possible in these 2 steps.
- Sliding step speedup by pre-processing the
pattern, - Knuth-Morris-Pratt algorithm
- Boyer-Moore algorithm.
- Ukkonens algorithm to construct suffix trees
- Comparison step speedup
- Rabin-Karp algorithm.
6Quick History
7The Intuition
- When Rabin-Karps algorithm is done with the ith
character, and advances to the next position in
the text, it does not use any of the information
gathered. - The KMP algorithm, on the other hand, puts that
information to good use.
The Idea
- Combine the key features of KMP and the
Rabin-Karp algorithms to achieve an online
algorithm that uses less space.
8Definitions - Fingerprints
Fingerprint
Sliding Fingerprint
- Polynomial Fingerprint
- q s1r s2r2 slrl mod p, where p??(N4),
r?Fp - False Positives
- If S1 ? S2, then probability of ?r,p(S1)
?r,p(S2) is lt 1/n3
9Definitions - PeriodPl
- Period - A prefix Sp s1,s2,.,sl of a string S
is defined to be a period of S, iff si sil,
for 0 i n - l - PeriodPl - For a pattern P p1,p2,.,pm, prefix
is, Pl p1,p2,.,pl ,0 l m. The shortest
period of Pl is periodPl
Put the information to good use
- If Pl matches the test at a given index i, then
there cannot be a match between i to i
periodPl
10The Idea
False Positives?? Slide over periodPl position
that could be a match. Very LOW PROBABILITY of
false positives
- Match at ith index indicates that we know the
last m characters, so no point saving them? - Preprocessing phase Calculate Sliding
fingerprint on the pattern ?p and on the shortest
period ?period p - Online phase Slide fingerprint ? over the
entire text. - While ? ?p, slide ? by PeriodPl characters
- If we do not reach end of text abort
Text and pattern should satisfy stringent
restrictions
11Go for subpatterns
p1, p2, p3, pm-3, pm-2, pm-1, pm
P1
pm
pm-2 ,pm-1
pm-6,pm-5,pm-4,pm-3
P2
P4
p1, p2, p3, pm/2
Pm/2
- Starting point Find a position in which the
smallest subpattern matches the text. Smallest
subpattern is of length 1 this can be easily
found.
12Algorithm
- Guidelines
- Find a position where Pi is a match, try to
match Pi 1 from the same starting point as Pi - If Pi 1 does not match, use the information
that Pi is a match. - Check in jumps of periodPi until there is no
overlap with the area where Pi matches. - PROCESS
- Initialize an empty sliding fingerprint ?.
- For each character that arrive
- Extend ? to include the new character
- If ? 2i and ? ?i for some 0 i log m.
- If ? has at least periodPi-1 length overlaps
with the last match, slide ? by periodPi-1
characters. - Else, abort.
What if there is a match that starts in substring
of 1st process and ends in substring of 2nd
process
13Exact_PM final AlgorithmIntroduce Checkpoint
- Checkpoint - Start a new process in the
last checkpoint of each process - Algorithm
- Preprocessing -
- Initialize an empty sliding fingerprint ?.
- For each 0 i log m calculate the sliding
fingerprint - ?i of Pi and
- ?i,period of the period of Pi
14Final Algorithm Online Phase
- Online Phase
- Start a new process
- For any character that arrive send it to all the
processes - If some process aborts start new prorcess
- If some process , A reaches to a checkpoint
- Stop the son process of A (if it has one)
- Start a new son process of A
15Complexity
- Space
- All fingerprints from preprocessing use O(log m)
space. - Each process saves another fingerprint and there
can be atmost log m processes in parallel - OVERALL usage O(log m) space
- Time
- Each process spends O(1) time for each new
character that arrives - Each time there are at most 3 log m processes
running (1. process A, 2. son-process of A,
grandson-process of A. A has to die when
great-granson of A is created) - OVERALL running time O(log m) per character
16Pattern Matching ( 1 Mistmatch)
- Partition the pattern and the text
- We need to align every partition of the pattern
Pqi,j to qi text shifts
17Intuition
- For each Pqi,j, run qi processes of Exact_PM.
- Processqi,j,s - sth process of the subpattern
Pqi,j , for 0 s lt qi. This will try to match
the Pqi,j to the text by considering the text as
if it starts from the s character. (t mod qi j
s) - If for all qi,
- numOfNotMatchqi,s 0 match.
- numOfNotMatchqi,s 1, exactly
1-mismatch - Otherwise, more than 1-mismatch.
18Complexity
- FACTS
- Run ?li1 qi2 processes of Exact_PM
- There exists a constant c such that for any x,
there exist (x / logm) prime numbers, between x,
and cx - We have q1,q2, . . . ql groups of partitions.
Each qi is a prime number - Space - O(log4m / log log m)
- Time - O(log3m / log log m)
19Pattern Matching ( k Errors)
- Preprocessing Phase Initialize a process
Processqi,j,s of 1-mismatch, for each qi ?
q1,q2, . . . ql, 0 i qi and 0 s lt qi - Online Phase Send t character to each
Processqi,j,s such that t mod qi j s - d all mismatches from all processes that return
exactly 1-mismatch - d gt k more than k mismatches
20Complexity
- Space
- Run ?i1klogm qi2 ? O(k3 log4m/ log log m)
processes of 1-mismatch in parallel. - Each process requires log4m space.
- OVERALL - O(k3poly(log m))
- Time
- Number of processes of 1-mismatch algorithm is
bounded by ?i1klogm qi2 ? O(k3 log4m/ log log m)
- Running time of each character O(log3m)
- OVERALL - O(k2poly(log m))
21Concluding Discussion
- The Two-Dimensional String-Matching Problem
- The String-Matching Problem with Wild Characters
Example pattern P abcabc is found in
texts T1 abcdcadbaccabc, T2 abcabc - String matching with weighted mismatch