Title: A Fast String Searching Algorithm
1A Fast String Searching Algorithm
- Robert S. Boyer,
- and J Strother Moore.
- Communication of the ACM,
- vol.20 no.10 , Oct. 1977
2Outline
- Introduction
- The Knuth-Morris-Pratt algorithm
- The Boyer-Moore algorithm
- Bad Character heuristic
- Good Suffix heuristic
- Matching Algorithm
- Experimental Result
- Conclusion
3Introduction
- String Matching
- Searching a pattern from a text or a longer
string. - If the pattern exist in the string, return the
position of the first character in the substring
which match the pattern.
4Introduction (cont.)
- Some definition
- m the length of the pattern.
- n the length of the string( or text ).
- s (shift) the distance between first
character of matched substring and start
character. - w ? x a string w is a prefix of a string x.
- w ? x a string w is a suffix of a string x.
5Introduction (cont.)
- The naive string-matching algorithm
- Time Complexity
- T((n-m1)m) in the worse case.
- T(n2) if m
for s ? 0 to n-m do if pattern1..m
strings1..sm printf Pattern occurs with
shift s
6Knuth-Morris-Pratt Algorithm
s q s k
7Knuth-Morris-Pratt Algorithm(cont.)
- Prefix Function
- f(j) largest i lt j such that P1..i
Pj-i1..j - 0 if I dose not exist.
A
B
A
B
A
Pq
Pk ? Pq
A B A
Pk
8Knuth-Morris-Pratt Algorithm(cont.)
- Prefix Function Algorithm
f1 ?0 k?0 for q?2 to m do while kgt0 and
Pk1 ?Pq do k ? fk if Pk1
Pq then k ? k1 fq k return f1..m
9Knuth-Morris-Pratt Algorithm(cont.)
3
2
1
0
- Time Complexity
- Prefix function O(m) by amortize analysis
- Matching function O(n)
- Total O(mn) ? Linear Complexity
10The Boyer-Moore Algorithm
- Symbols used
- S the set of alphabets
- patlen the length of pattern
- m the last m characters of pattern matched
- char the mismatched character
char
string
pattern
m
11Characteristic
- Match pattern from rightmost character of the
pattern to the left most character of the
pattern. - Pattern is relatively long, and S is reasonably
large, this algorithm is likely to be the most
efficient string-matching algorithm.
12Bad Character heuristic
- Observation 1
- if the char doesnt occur in pat
- Pattern Shift j character
- String pointer shift patlen character
- Example
-
A D C A B C A B A
13Bad Character heuristic (cont.)
- Observation 2
- The char occur in the pattern
- The rightmost char in pattern in position
d1char and the pointer to the pattern is in j - If j lt d1 char we shift the pattern right by 1
- If j gt d1 char we shift the pattern right by
- j- d1 char
- d1 is an array which size is the size of S
14Bad Character heuristic (cont.)
A C B B A C A B C A
A B C
j 3 and d1B 2 pattern shift 1 string
pointer shift 1 (m pattern shift)
15Good Suffix heuristic
- 2 sequence c1.. cn and d1.. dn is unify if
for j from 1 to patlen, either ci di or ci
or di , which be a character doesnt occur
in pat. - the position of rightmost plausible reoccurrence,
rpr(j) k , such that pat(j1)..pat(patlen)
and pat(k)..pat(kpatlen j - 1) are unify,
and either k1 or pat(k-1) ?pat(j)
16Good Suffix heuristic (cont.)
- Example
- Pattern shift j1 rar(j)
- String pointer shift m j 1 rar(j)
- strlen j j 1 rar(j)
- d2j
-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
A B X Y C D E X Y
-7 -6 -5 -4 -3 -2 3 0 1
j
pat
rpr(j)
17Good Suffix heuristic (cont.)
18Boyer-Moore Matching Algorithm
i patlen if n lt patlen return false j
patlen While j gt 0 do if string(i)
pat(j) j j-1 i i-1 else i i
max(d1(string(i)) , d2 (j) ) if i gt n then
return false
19Boyer-Moore Matching Algorithm
- Time Complexity
- Bad Character heuristic O(patlen)
- Good Suffix heuristic O(patlen)
- Matching O(n)
- Total O(npatlen)
20Experimental Result
21Conclusion
- Boyer-Moore algorithm have sublinear time
complexity O(nm) - Boyer-Moore is most efficient string matching
algorithm when pattern is long and character is
reasonably large.