Title: Searching a String with the Boyer-Moore Algorithm
1Searching a String with the Boyer-Moore Algorithm
yxamplegsreinfkaeijajkjalijEnknfejienanfhytoirht0
8to43508gjsfnbgfwurhqqjwnsjdlhfjsng83uu5hfaw09854w
09ruwij0w9ut94u5t943543r01355738989002211esacbnmas
dfghjklq3wwrtyiuiopun4n5ns4e2232tg7msgism8k942uq2n
ac368723245gm3mjjwihwhrhwqnqn
- Shana Rose Negin
- December 14, 2000
2Boyer-Moore String Search
- How does it work?
- Examples
- Complexity
- Acknowledgements
3How Does it Work?
- Pattern moves left to right.
- Comparisons are done right to left.
- Uses two heuristics
- Bad Character
- Good Suffix
- Each heuristic is put into play when a mismatch
occurs. They give us the maximum number of
characters the search pattern can move forward
safely and still know that there are no
characters that need to be checked.
4Pattern Moves Left to Right
Text Several hours later, Cindy Pattern indy
Text Several hours_later, Cindy Pattern
indy Text Several hours later,
Cindy Pattern indy
Start
Middle
End
5Comparisons are done right to left.
First Comparison
Text Several hours_later, Cindy Pattern
indy Text Several hours_later,
Cindy Pattern indy Text Several
hours_later, Cindy Pattern indy Text
Several hours_later, Cindy Pattern
indy
Second Comparison
Third Comparison
Fourth Comparison
6Three Parts to the Bad Character Heuristic
1. When the comparison gives a mismatch, the
bad-character heuristic proposes moving the
pattern to the right by an amount so that the bad
character from the string will match the
rightmost occurrence of the bad character in the
pattern. 2. If the bad character doesnt
occur in the pattern, then the pattern may be
moved completely past the bad character. 3. If
the rightmost occurrence of the bad character is
to the right of the current bad character
position, then this heuristic makes no proposal.
7Bad Character Heuristic
1. When the comparison gives a mismatch, the
bad-character heuristic proposes moving the
pattern to the right by an amount so that the bad
character from the string will match the
rightmost occurrence of the bad character in the
pattern.
Text Youve got a funny face, man.
Pattern cite Text Youve got a funny
face,_man. Shift cite
Shifted two characters to match up the cs.
8Bad Character Heuristic
2. If the bad character doesnt occur in the
pattern, then the pattern may be moved completely
past the bad character.
Text Youve got a funny face, man. Pattern
poor Text Youve got a
funny face, man. Shift
poor
Shifted four characters because there was no
match.
9Bad Character Heuristic
3. If the rightmost occurrence of the bad
character is to the right of the current bad
character position, then this heuristic makes no
proposal.
Text There are no babies here. Pattern
drab Text There are no babies
here. Shift drab
The shift proposed would be negative, so it is
ignored.
10Good Suffix Heuristic
The good-suffix heuristic proposes to move the
pattern to the right by the least amount so that
a group of characters in the pattern will match
with the good suffix found in the text.
Text ...I wish I had_an apple instead
of... Pattern banana
Text ..I wish I had an apple instead of...
Shift banana
Shift two so that the second occurrence of an
in banana matches the characters an in the
string.
11EXAMPLE
Text Pattern im a grad. dad is glad grad
Im_a_grad._dad_is_glad grad grad grad
grad grad grad grad
Bad-character
Good-Suffix
Match
1
2
3
7
4
11
12 comparisons out of 22 characters.
5
8
12
6
9
10
12EXAMPLE
Text Where are you moving? What are you doing?
Pattern grad
Bad-character
Good-Suffix
Match
Where_are_you_moving?_What_are_you_doing? grad
grad grad grad
grad grad grad
grad grad
grad grad
10 comparisons out of 41 characters.
Last grad is longer than the remaining string,
so it is discarded before it is counted.
13Applets
- http//www.accessone.com/lorre/pages/bmi.html
- http//www.i.kyushu-u.ac.jp/takeda/PM_DEMO/e.htm
l
14The Algorithm
Sigma alphabet in use T Search string
(text) P Pattern N lengthT M
lengthP L Compute_Last_Occurrence_Function(P,
M, Sigma) (for bad-character heuristic) Y
Compute_Good_Suffix_Function(P, M) (for
good-suffix heuristic) s 0 while (s lt n-m)
(j m) while (j gt 0 AND Pj Tsj)
j-- if (j0) print(Pattern FOUND!!!
Location s) s s Y0 else s s
max(Yj, j-LTsj)
15Compute_Last_Occurrence_Function
Sigma alphabet in use T Search string
(text) P Pattern N lengthT M lengthP
Compute_Last_Occurance_Function(P, M, Sigma) /
Contained in the array L, there is a field for
every letter in the alphabet. When this function
is finished computing, the number in La will
represent the number of characters from the
beginning of the pattern that the rightmost a
lies Lb will contain the distance from the
beginning of the pattern for the right most
occurrence of b, and so on. EXAMPLE
pattern jeff L-gt / for (each
character a in sigma) // Initialize all fields
to 0 La 0 for (j 0 j lt m j) //
For every letter in the pattern, LPj
j // record its distance from the start
return L // of the pattern
1
/ COMPLEXITY O(Sigma M) /
16Compute_Good_Suffix_Function
Sigma alphabet in use T Search string
(text) P Pattern N lengthT M lengthP
Compute_Good_Suffix_Function(P, M) / First get
the prefix. The fields of Y represent the
distance of the suffix from the start of the
pattern, using the rightmost character as a
reference. Then it searches the pattern to find
the next rightmost occurrence of the suffix, and
recommends that shift. If there is no other
occurrence, it recommends a shift of the length
of the pattern / Pi Compute_Prefix_Function(P
) P Reverse(P) Pi Compute_Prefix_Function(P
) for (i 0 i lt M i) Yi M -
PiM for (j 0 j lt M j) i M -
Pij if (YI gt j - Pij YI j -
Pil return Y
/ COMPLEXITY O(M) /
17The Main Loop
Sigma alphabet in use T Search string
(text) P Pattern N lengthT M lengthP
while (s lt n-m) // for every shift (j
m) // while (j gt 0 AND Pj Tsj)
// for the length of the pattern j-- //
if (j0) // if you reach the
beginning of the //
pattern, print(Pattern FOUND!!! Location
s) // You found the pattern! s s
Y0 // Tell someone and
shift else // the length of the
pattern s s max(Yj, j-LTsj) //
else, choose the greater of the // two
heuristic results
18Complexity
O((nm1)mSigma)
- Compute_Last_Occurrence O(Sigma m)
- Compute_Good_Suffix O(m)
- Number of shifts O(n-m1)
- Time to check the new shift O(m)
- Total (Sigmam) m m(n-m1)
O(NM)
Worst Case
19HOWEVER...
20IN PRACTICE...
21the algorithm takes sub-linear time
22Specifically, in the best case, the algorithms
running time is O(N/M) (length of text
over length of pattern)
23The complexity is best when the letters in the
pattern dont match the letters in the text very
often. Since this is generally the case, the
average running time ends up being approximately
equivalent to the best case. O(N/M) (length
of text over length of pattern)
24Conclusion
The Boyer-Moore algorithm is a very good
algorithm. Its worst case running time is
linear its best case running time is sub-linear.
Most of the time it tends toward the best case
rather than the worst case. I recommend the
boyer-moore algorithm for searching a string.
Shana Negin 252a-as December 14, 2000 Algorithms
csc252
25Acknowledgements
Corman Chapter 34.5 Cole, Richard Tight Bounds
on the complexity of the Boyer-Moore
string-matching algorithm. New York
University http//www.accessone.com/lorre/pages/b
mi.html http//www.i.kyushu-u.ac.jp/takeda/PM_DE
MO/e.html
26Interesting Uses
William Hsu, a Computer Science at the Johns
Hopkins University, has used the Boyer-Moore
algorithm in a virus detection project.
http//www.mactech.com/articles/mactech/Vol.08/0
8.02/VirusDetection/
27One Problem
UNICODE has 65,536 characters which makes string
searching very time consuming, even using
Boyer-Moore. http//www-4.ibm.com/software/devel
oper/library/text-searching.html?dwzoneunicodeBo
yer