Title: Boyer-Moore String Searching Algorithm
1Boyer-Moore String Searching Algorithm
2String-Searching Algorithms
- The goal of any string-searching algorithm is to
determine whether or not a match of a particular
string exists within another (typically much
longer) string. - Many such algorithms exist, with varying
efficiencies. - String-searching algorithms are important to a
number of fields, including computational
biology, computer science, and mathematics.
3The Boyer-Moore String Search Algorithm
- Developed in 1977, the B-M string search
algorithm is a particularly efficient algorithm,
and has served as a standard benchmark for string
search algorithm ever since. - This algorithms execution time can be
sub-linear, as not every character of the string
to be searched needs to be checked. - Generally speaking, the algorithm gets faster as
the target string becomes larger.
4How does it work?
- The B-M algorithm takes a backward approach
the target string is aligned with the start of
the check string, and the last character of the
target string is checked against the
corresponding character in the check string. - In the case of a match, then the second-to-last
character of the target string is compared to the
corresponding check string character. (No gain
in efficiency over brute-force method) - In the case of a mismatch, the algorithm computes
a new alignment for the target string based on
the mismatch. This is where the algorithm gains
considerable efficiency.
5An example
- Target string rockstar
- Check string -------x-----
- Aligning the start of each string pairs r with
x. - Since x is not a character in rockstar, it
makes no sense to check alignments beginning with
any character in the check string which comes
before x, and the B-M algorithm skips all such
alignments. - This eliminates several (7, in this case)
alignments to be checked by the algorithm, and we
needed to compare only two characters.
6Efficiency of the B-M Algorithm
- The average-case performance of the B-M
algorithm, for a target string of length M and
check string of length N, is N/M. - In the best case, only one in M characters needs
to be checked. - In the worst case, 3N comparisons need to be
made, leading to a complexity of O(n), regardless
of whether or not a match exists.
7Pre-processing Tables
- The B-M algorithm computes 2 preprocessing tables
to determine the next suitable alignment after
each failed verification. - The first table calculates how many positions
ahead of the current position to start the next
search (based on character which caused failed
verification). - The second table makes a similar calculation
based on how many characters were matched
successfully before a failed verification - These tables are often referred to as jump
tables, though this leads to some ambiguity with
the more common meaning of the term in computer
science, which refers to an efficient way of
transferring control from one part of a program
to another.
8Calculation of Preprocessing Tables
- Table 1
- Starting at the last character of the target
string, move left toward the first character. At
each character, if the character is not already
in the table, add it to the table. - This characters shift value is equal to its
distance from the right-most character in the
string. - All other characters receive a shift value equal
to the total length of the string. - Example peterpan would produce the following
table (character, shift) (A, 1), (P, 2), (R,
3), (E, 4), - (T, 5), (all other characters, 8)
9Calculation of Preprocessing Tables
- Table 2
- First, for each value of i less than the length
of the target string, calculate the pattern of
the last i characters of the target string
preceded by a mis-match for the character before
it. - Then, determine the least number of characters of
the partial pattern that must be shifted left
before two patterns match. - Example for ANPANMAN, the table would be (I,
pattern, shift) (0, -N, 1), (1, (-A)N, 8), (2,
(-M)AN, 3), (3, (-N)MAN, 6), (4, (-A)NMAN, 6),
(5, (-P)ANMAN, 6), (6, (-N)PANMAN, 6), (7,
(-A)NPANMAN, 6). (here, -X means not X)
10Comparison of String Searching Algorithm
Complexities
- Boyer-Moore O(n)
- Naïve string search algorithm O((n-m1)m)
- Bitap Algorithm O(mn)
- Rabin-Karp string search algorithm average
O(nm) - (n length of search string, m length of
target string)
11About the Creators
- Robert Boyer is a retired Professor Emeritus of
the University of Texas at Austin Computer
Science Department. He received his BA and PhD
in mathematics at UT Austin, and has authored and
co-authored several books concerning automatic
theorem-proving.
J. Strother Moore is Admiral B.R. Inman
Centennial Chair in Computer Theory of the
Department of Computer Sciences at UT Austin. He
received his BS in mathematics from MIT in 1970,
and his PhD in computational logic from the
University of Edinburgh in 1973. He has authored
and co-authored several books concerning
automatic theorem-proving, some of them in
cooperation with Robert Boyer.
12References
- Wikipedia.org
- http//www-igm.univ-mlv.fr/lecroq/string/
- Epp, Susanna S. Discrete Mathematics with
Applications. 3rd Ed., Brooks/Cole 2004.