Title: Matching Algorithms
1Chapter 7
2Chapter Outline
- String Matching
- Straightforward matching
- Finite automata
- Knuth-Morris-Pratt algorithm
- Boyer-Moore algorithm
- Approximate string matching
3Prerequisites
- Before beginning this chapter, you should be able
to - Create finite automata
- Use character strings
- Use one- and two-dimensional arrays
- Describe growth rates and order
4Goals
- At the end of this chapter you should be able to
- Explain the substring matching problems
- Explain the straightforward algorithms and its
analysis - Explain the use of finite automata for string
matching
5Goals (continued)
- Construct and use a Knuth-Morris-Pratt automaton
- Construct and use slide and jump arrays for the
Boyer-Moore algorithm - Explain the method of approximate string matching
6Matching Algorithms
- The general problem is to find a string of
characters in a larger piece of text - Could also be used to find any pattern of bits or
bytes in a larger binary file - Algorithms find the first occurrence of the
string in the larger text - We will assume that the string length is S and
the text length is T when we analyze the
algorithms
7Straightforward Matching Example
8Straightforward Matching
- We compare the first character of the string with
the first character of the text - If they match, we move to the next character
until we have matched the entire string or found
a mismatch - If there is a mismatch, we move the string by one
place and start again
9The Straightforward Algorithm
- subLoc 1
- textLoc 1
- textStart 1
- while textLoc length(text)
- and subLoc length(substring) do
- if text textLoc substring subLoc then
- textLoc textLoc 1
- subLoc subLoc 1
- else
- textStart textStart 1
- textLoc textStart
- subLoc 1
- end if
- end while
- if subLoc gt length(substring) then
- return textStart
- else
- return 0
- end if
10Analysis
- In the worst case, we succeed on each comparison
of the string with the text except for the last - This is possible if the string is all X
characters except for one Y at the end and the
text is all X characters - In this case, we do S(T-S1) comparisons
11Analysis
- Natural language texts do not have this sort of
pattern, so the algorithm will do better with
them - This is because there is an uneven distribution
of character use in natural language - Studies show that this algorithm uses a little
over T comparisons on a natural language text
12Finite Automata
- Finite automata are used to decide whether a word
is in a given language - We could set up a finite automaton to accept the
string we are looking for and then if we wind up
in the accepting state, we know we found the
string and can stop
13Finite Automata
- Because we will look at each text character once,
this will do at most T comparisons - However, the algorithm to construct a finite
automaton from a string takes a long time
14Knuth-Morris-Pratt Algorithm
- For each character comparison, we can either
succeed or fail - The Knuth-Morris-Pratt (KMP) algorithm constructs
an automaton that labels the nodes with the
string characters and has a success and fail link
for each node
15Knuth-Morris-Pratt Algorithm
- The success links are easy to determine because
they just take us to the next node - The fail links will take us back in the automaton
and are based on the string we are trying to
match - We will get a new character of the text when we
succeed in matching, but will reuse that
character if we fail
16Knuth-Morris-Pratt Example
- The automaton for the string ababcd would be
17Knuth-Morris-Pratt Matching Algorithm
- subLoc 1
- textLoc 1
- while textLoc length(text)
- and subLoc length(substring) do
- if subLoc 0 or
- text textLoc substring subLoc then
- textLoc textLoc 1
- subLoc subLoc 1
- else
- subLoc fail subLoc
- end if
- end while
- if subLoc gt length(substring) then
- return textLoc - length(substring)
- else
- return 0
- end if
18Knuth-Morris-Pratt Failure Link Algorithm
- fail 1 0
- for i 2 to length(substring) do
- temp fail i - 1
- while temp gt 0 and substring temp ?
substring i - 1 do - temp fail temp
- end while
- fail i temp 1
- end for
19Failure Link Analysis
- The ? comparison will be false at most S 1
times - The fail links are all smaller than their index
- temp is decreased each time the ? is true
- The while loop is not done on the first pass
20Failure Link Analysis
- The variable temp is incremented by 1 for the
next pass because of - The final statement of the for loop
- The increment of i
- The first statement of the while loop
- There are S 2 next passes, so temp in
incremented S 2 times - Because fail10, temp never becomes negative
21Failure Link Analysis
- temp starts at 0 and is incremented no more than
S 2 times - Because temp is decreased for each mismatched
comparison, there are at most S 2 failed
comparisons - There are S 1 successful comparisons, so there
are at most 2S 3 comparisons
22Match Algorithm Analysis
- The while loop does one character comparison per
pass - Either textLoc and subLoc are incremented or
subLoc is decremented - Because textLoc starts at 1 and is never greater
than T, it is incremented no more than T times
23Match Algorithm Analysis
- Because subLoc starts at 1 and is never greater
than T, it is decremented no more than T times - This means that the then clause is done no more
than T times and the else clause is done no more
than T times, so there are no more than 2T
comparisons
24Knuth-Morris-Pratt Analysis
- The fail link construction takes 2S-3 comparisons
and the matching takes 2T comparisons - The KMP algorithm is O(S T), where the standard
algorithm is O(S T)
25Boyer-Moore Algorithm
- If we match from the right of the string, a
mismatch might help us move the string a bigger
distance in the text to skip over other mismatch
locations that can be predicted
26Boyer-Moore Algorithm
- We have to also consider what we have matched, so
we do not make too small of a move - If we move the string by one position to line up
the two t characters we will fail quickly, but
that could be predicted
27Boyer-Moore Algorithm
- This algorithm calculates a slide and a jump move
- The slide value tells us how much the pattern
should be moved to line up the text character
that did not match - The jump value tells us how much to move the
pattern to line up the end characters that
matched with their occurrence earlier in the
string
28Boyer-Moore Matching Algorithm
- textLoc length(pattern)
- patternLoc length(pattern)
- while (textLoc length(text)) and (patternLoc gt
0) do - if text textLoc pattern patternLoc then
- textLoc textLoc - 1
- patternLoc patternLoc - 1
- else
- textLoc textLoc
- MAXIMUM(slidetexttextLoc,jumppatternLoc)
- patternLoc length(pattern)
- end if
- end while
- if patternLoc 0 then
- return textLoc 1
- else
- return 0
- end if
29Deciding on a Slide Value
30Boyer-Moore Slide Array Algorithm
- for every ch in the character set do
- slide ch length(pattern)
- end for
- for i 1 to length(pattern) do
- slide patterni length(pattern) - i
- end for
31Boyer-Moore Jump Array Algorithm
- for i 1 to length(pattern) do
- jump i 2 length(pattern) - i
- end for
- test length(pattern)
- target length(pattern) 1
- while test gt 0 do
- linktest target
- while target length(pattern)
- and patterntest ? patterntarget do
- jumptarget MINIMUM( jumptarget,
- length(pattern)-test )
- target linktarget
- end while
- test test - 1
- target target - 1
- end while
32Boyer-Moore Jump Array Algorithm
- for i 1 to target do
- jump i MINIMUM( jump i ,
- length(pattern) target - i )
- end for
- temp link target
- while target lt length(pattern) do
- while target temp do
- jumptarget MINIMUM(jumptarget,
- temp-targetlength(pattern))
- target target 1
- end while
- temp linktemp
- end while
33Jump Array Calculation Example
34Boyer-Moore Analysis
- The slide array calculation does O(A P)
assignments but no comparisons - The jump array calculation at worst compares all
of the pattern characters with those appearing
later for O(P2) comparisons
35Boyer-Moore Analysis
- Studies have shown that with natural language
text, and a pattern of six or more characters,
there are at most 0.4T comparisons - As the length of the pattern increases, the
algorithm has a lower value of about 0.25T
comparisons
36Approximate String Matching
- Spelling checkers will make suggestions of close
words that could have been intended for
misspelled words - This involves finding words that are close to the
misspelled word - We will talk about approximate string matching in
terms of a string and text as in the other
algorithms
37Common errors
- The string could have characters that are missing
from the text - The text could have characters that are missing
from the string - There could be a character in the string or the
text that needs to be changed
38Errors Example
- Matching the string ad with the text read we
could have - 2 mismatches in the first position or a missing
re from the string - 2 mismatches in the second position or just a
missing e from the string
39The Algorithm
- This can be complex because it might be that a
better match occurs if we look at other
possibilities - In the example above, for the second position
there were 2 mismatches of characters, but we get
a better result if we add just one character to
the string
40The Algorithm
- To keep the algorithm a little simpler, we use a
larger structure to keep track of what we have
found so far - In this case, we will keep a two-dimensional
array with the best matches found so far - This array will have a row for each character of
the string and a column for each character of the
text
41The Array
- For each location of the array diffsi, j, we
will choose the minimum of - If stringi textj, diffsi 1, j 1otherwise
diffsi 1, j 1 1 - diffsi 1, j 1
- diffsi, j 1 1
42Example
- If we compare the string trim with the text
try the trumpet we get
43Analysis
- We do not really need the entire array, but just
need two columns - the current one and the
previous one on which it is based - We compare each string character with each text
character and so do ST comparisons