Title: String Matching Algorithms Based upon the Uniqueness Property
1String Matching Algorithms Based upon the
Uniqueness Property
C. W. Lu and R. C. T. Lee, 2007, String Matching
Algorithms Based upon the Uniqueness Property,
The 24th Workshop on Combinatorial Mathematics
and Computation Theory, pp.385-392.
- Advisor Prof. R. C. T. Lee
- Speaker C. W. Lu
2- String matching problem
- Given a text string T of length n and a pattern
string P of length m. - Find all occurrences of P in T.
3Rule 1 The Suffix to Prefix Rule
- Suppose we have longest suffix u of a window
which is also a prefix of P, we can move P in
such a way that the prefix u of P matches with
the suffix u of the window.
4The Uniqueness Property of a String
- For any substring V of P, if V occurs in P only
once, V is a unique substring. - When V matches with some substring of T, we can
move P such a way that the prefix of P matches
with the suffix of V.
5Example
P c a t a g t a g c c t Suppose we use the
substring cc as the unique substring.
6Algorithm 1- The Longest Prefix with Unique
Suffix Matching Algorithm
- We further modified the uniqueness by noting that
the substring does not have to be unique in the
entire pattern P. In fact, a substring which is
unique in a prefix of P suffices. - Therefore, we only have to find the longest
prefix which contains a unique suffix in P.
7Example
P CACTAGCCACTCTC The substring TC occurs twice
in P, but it is unique in the prefix CACTAGCCACTC.
Move P 11 steps.
8Example
P CACTAGCCACTCTC The substring G is also unique
in the prefix CACTAG.
Move P 6 steps.
9P CACTAGCCACTCTC
In the above example, using the unique substring
TC, we could move P 11 steps if TC matches with
TC in T using the unique substring G, we could
move P 6 steps if G matches with G in T.
Is the unique substring TC better than the unique
substring G?
10- We should notice that if the unique substring
appears in T many times, our algorithm would be
efficient. - In general, the probability of TC in P matching
with TC in T exactly is 1/16 (Suppose the size of
alphabet is 4), and the probability of G in P
matching with G in T exactly is 1/4. - Thus, the size of the unique substring is also
important.
11P CACTAGCCACTCTC
- If the substring TC in P exactly matches with TC
in T once and moves P by 11 steps, the substring
G in P may match G in T four times and moves P by
6 steps for each time. So, we expect that the
substring G would be better than the substring TC
in general.
12- We now define a ratio to determine which
substring is better. - Let S be the alphabet.
- The larger s is, the better efficiency can be
achieved in the searching phase.
13Preprocessing Phase
P CAGACGACCCCAACAGC S A, C, G, T, S 4.
Find the longest prefix with an unique suffix
which size is one.
14Preprocessing Phase
- We have found the unique substring with size 1,
and we could use it to move P 3 steps. - Next, we try to find an unique substring with
size 2 such that we could use this substring to
move P more than 34 steps. - Thus, we only consider the substrings of
p12p13p16.
15Searching Phase
If the unique substring mismatches, move P one
step.
Move 1 step.
16Searching Phase
If the unique substring GC matches with GC in T,
move P 16 steps.
Move 16 steps.
17- As we discuss above, the size of the unique
substring is important. - In the following, we will introduce another
algorithm which uses an unique substring with
size one.
18Algorithm 2- Longest Substring with Unique
Character Matching Algorithm
- In the window, let x be any character. In order
to have any meaningful matching of P with T, we
must find the same x in P located in the left
side of x in T.
19- In preprocessing phase, we try to find the
longest substring p in P such that x in p
occurs only once. That is, - and pj occurs in p only once.
20- If the unique character x matches with x in T, we
can move P p steps.
21Example
In this example, we would find the longest
substring p4p5p10 with a unique character p10.
If the character p10 matches with T, we can move
P 7 steps.
22Searching Phase
If p10 mismatches, move P one step.
Move 1 step.
23Searching Phase
If p10 matches with T, move P 7 steps.
Move 7 steps.
24Algorithm 3- The Unique Pairwise Substring
Algorithm
- The substring pipi1pj-1pj is called an unique
pairwise substring if it satisfies the condition
that pipi1pj-1pj occurs in the prefix
p1p2pj-1pj of P exactly once, and no
pkpk1pkj-i exists in p1p2pj-1 such that
pk pi and pkj-i pj.
25Example
The substring TCG is an unique pairwise substring
because no pkpk1pk2 exists in p1p2p12 such
that pk p11 T and pk2 p13 G.
The substring CAC is not an unique pairwise
substring because there exists a substring p2p3p4
in p1p2p9 such that p2 p8 C and p4 p10 C.
26- Suppose pipi1pj-1pj is an unique pairwise
substring. - If pi and pj match with T, we have two cases to
move P.
Case 1 such that pj pk, where
0?k?j-i-1. We can move P j-k steps.
27Case 2 pj ? pk, where 0?k ?j-i-1.
We can move P j1 steps.
28Example
If we choose p11p12p13 as the unique pairwise
substring, we can move P 14 steps when p11 and
p13 match with T.
29- There would be many unique pairwise substrings in
the pattern. - We will select the one which is located at
rightest in the pattern.
Example
The substrings p5p6, p7p8p9 and p11p12p13 are all
unique pairwise substrings. We would select
p11p12p13 because it will have the largest move.
30Example
If p11 or p13 mismatch, move P one step.
31Example
If p11 and p13 match with T, move P 14 steps.
32References
- 1 Apostolico, A., Giancarlo, R., 1986, The
Boyer-Moore-Galil string searching strategies
revisited, SIAM Journal on Computing
15(1)98-105. - 2 Apostolico, A., Crochemore, M., 1991, Optimal
canonization of all substrings of a string,
Information and Computation 95(1)76-95. - 3 Boyer, R.S., Moore, J.S., 1977, A fast string
searching algorithm. Communications of the ACM.
20762-772. - 4 Colussi, L., 1991, Correctness and efficiency
of the pattern matching algorithms, Information
and Computation 95(2)225-251. - 5 Crochemore, M., Czumaj, A., Gasieniec, L.,
Jarominek, S., Lecroq, T., Plandowski, W.,
Rytter, W., 1992, Deux méthodes pour accélérer
l'algorithme de Boyer-Moore, in Théorie des
Automates et Applications, Actes des 2e Journées
Franco-Belges, D. Krob ed., Rouen, France, 1991,
pp 45-63, PUR 176, Rouen, France. - 6 Colussi, L., 1994, Fastest pattern matching
in strings, Journal of Algorithms. 16(2)163-189. - 7 Charras, C., Lecroq, T., Pehoushek, J.D.,
1998, A very fast string matching algorithm for
small alphabets and long patterns, in Proceedings
of the 9th Annual Symposium on Combinatorial
Pattern Matching , M. Farach-Colton ed.,
Piscataway, New Jersey, Lecture Notes in Computer
Science 1448, pp 55-64, Springer-Verlag, Berlin.
33- 8 Galil, Z., Seiferas, J., 1983, Time-space
optimal string matching, Journal of Computer and
System Science 26(3)280-294. - 9 Galil, Z., Giancarlo, R., 1992, On the exact
complexity of string matching upper bounds, SIAM
Journal on Computing, 21(3)407-437. - 10 Horspool, R.N., 1980, Practical fast
searching in strings, Software - Practice
Experience, 10(6)501-506. - 11 Knuth, D.E., Morris (Jr), J.H., Pratt, V.R.,
1977, Fast pattern matching in strings, SIAM
Journal on Computing 6(1)323-350. - 12 Lecroq, T., 1992, A variation on the
Boyer-Moore algorithm, Theoretical Computer
Science 92(1)119-144. - 13 Morris (Jr), J.H., Pratt, V.R., 1970, A
linear pattern-matching algorithm, Technical
Report 40, University of California, Berkeley. - 14 Sunday, D.M., 1990, A very fast substring
search algorithm, Communications of the ACM .
33(8)132-142. - 15 Simon, I., 1993, String matching algorithms
and automata, in in Proceedings of 1st American
Workshop on String Processing, R.A. Baeza-Yates
and N. Ziviani ed., pp 151-157, Universidade
Federal de Minas Gerais, Brazil.
34Thanks for your attention.