Title: CSE 024: Design
1CSE 024 Design Analysis of Algorithms
- Chapter 8 String Searching
- Sedgewick Chp19
2Course Content
- Introduction, Algorithmic Notation and Flowcharts
(Brassard Bratley Chp Chapter 3) - Efficiency of Algorithms (Brassard Bratley Chp
Chapter 2) - Basic Data Structures (Brassard Bratley Chp
Chapter 5) - Sorting (Weiss Chp 7)
- Searching (Brassard Bratley Chp Chapter 9)
- Graph Algorithms (Weiss Chp 9)
- Randomized Algorithms (Weiss Chp 10)
- String Searching (Sedgewick Chp19)
- NP Completeness (Sedgewick Chap. 40)
3Lecture Content
- Strings Applications
- String Operations
- What is Pattern Matching?
- Brute-Force Algorithm
- Knuth-Morris-Pratt Algorithm
- Boyer-Moore Algorithm
4String Applications
- Processing text is dealing with character strings
- Come from wide variety of sources
- DNA Applications
- PCGTAAACTGCTTTAATCAAACGC
- News headline
- RU.S. Men Win Soccer World Cup!
- URL of Web site
- Shttp//www.wiley.com/college/goodrich
- String operations
- Breaking large strings into smaller strings
- Pattern matching
5String Concepts
- Assume S is a string of size m.
- A substring Si .. j of S is the string fragment
between indexes i and j. - A prefix of S is a substring S0 .. i
- A suffix of S is a substring Si .. m-1
- i is any index between 0 and m-1
6Examples
S
a
n
d
r
e
w
0
5
- Substring S1..3 "ndr"
- All possible prefixes of S
- "andrew", "andre", "andr", "and", "an, "a"
- All possible suffixes of S
- "andrew", "ndrew", "drew", "rew", "ew", "w
7What is Pattern Matching?
- Definition
- given a text string T and a pattern string P,
find the pattern inside the text - T the rain in spain stays mainly on the plain
- P n th
- Applications
- text editors,
- Web search engines
- e.g. Google
- image analysis
8Brute-Force Algorithm
- Check each position in the text T
- to see if the pattern P starts in that position
T
a
n
d
r
e
w
T
a
n
d
r
e
w
r
e
w
P
r
e
w
P
P moves 1 char at a time through T
. . . .
9Brute-Force Algorithm
- The obvious method for pattern matching
- Just check, for each possible position in the
text - at which the pattern could match
- Search for the first occurrence of
- a pattern p 1. .M in a text string a 1. .N
- keep one pointer (i) into the text,
- another pointer (j) into the pattern.
- As long as they point to matching characters,
- both pointers are incremented.
- If the end of the pattern is reached (jgtM),
- then a match has been found.
- If i and j point to mismatching characters,
- then j is reset to point to the beginning of the
pattern - i is reset to correspond to moving the pattern to
the right one position for matching against the
text. - If the end of the text is reached (igtN)
- then there is no match.
- If the pattern does not occur in the text,
- the value Ni is returned.
int bruteSearch(int M, int N) int i,j i1
j1 repeat if aipj then i j else
ii-j2 j1 until (jgtM) or (igtN) if jgtM
then return i-M else return i
10Analysis of Algorithm
- Brute force pattern matching
- runs in time O(mn) in the worst case.
- But most searches of ordinary text
- take O(mn), which is very quick.
- The brute force algorithm is fast
- when the alphabet of the text is large
- e.g. A..Z, a..z, 1..9, etc.
- It is slower
- when the alphabet is small
- e.g. 0, 1 (as in binary files, image files, etc.)
11Algorithm Performance
- In a text-editing application,
- the inner loop of this program is seldom
iterated, - the running time is very nearly proportional
- to the number of text characters examined O(n)
- For example, look for the pattern STING in
- A STRING SEARCHING EXAMPLE CONSISTING OF SIMPLE
TEXT - the statement jj1 is executed only four times
- once for each S
- twice for the first ST before the match
12Algorithm Performance Binary Strings
- Example
- pattern is 00000001
- the text string is
- 00000000000000000000000000000000000000000000000000
001 - then j is incremented
- 745 (315) times before the match
- search for 10010111 in the binary string
- 100111010010010010010111000111
- 1001
- 1
- 10
- 10010
- 10010
- 10010
- 10010111
13Knuth-Morris-Pratt Algorithm
- The basic idea is
- when a mismatch is detected,
- our false start consists of characters
- we know in advance
- since they are in the pattern.
- take the advantage of this information
- instead of backing up the i pointer
- over all those known characters
14The Knuth-Morris-Pratt (KMP) algorithm
- looks for the pattern in the text
- in a left-to-right order
- like the brute force algorithm
- But it shifts the pattern more intelligently
- If a mismatch occurs
- between the text and pattern P at Pj,
- what is the most we can shift the pattern
- to avoid wasteful comparisons?
- Answer the largest prefix of P0 .. j-1
- that is a suffix of P1 .. j-1
15KMP Algorithm is Generalization
- Assume first character in the pattern
- doesnt appear again in the pattern
- say the pattern is 10000000
- Suppose we have a false start j characters long
- at some position in the text.
- When the mismatch is detected,
- j characters have already matched,
- No need to back up the text pointer i
- none of the previous j-1 characters in the text
- can match the first character in the pattern.
- This change could be implemented by replacing
- ii-j2 in the program above by i
- The practical effect of this change is limited
- such a specialized pattern is not particularly
likely to occur, - but the idea is worth of thinking
- Knuth-Morris-Pratt algorithm is a generalization.
- it is always possible to arrange things
- so that the i pointer is never decremented.
16- Fully skipping past the pattern
- on detecting a mismatch wont work
- when the pattern could match itself
- at the point of the mismatch.
- For example,
- searching for 10100111 in 1010100111
- First mismatch at the fifth character,
- but it is better to back up
- to the third character to continue the search,
- since otherwise we would miss the match.
- But we can figure out ahead of time what to do
- because it depends only on the pattern
17Example
i
T
P
j 5
jnew 2
18Knuth-Morris-Pratt Algorithm
- The pseudocode and explanation
int kmpsearch(int M, int N) int i,j i1
j1 repeat if (j0) or (aipb) then i
j else jnextj until (jgtM) or (igtN) if
jgtM then return i-M else return 1
- When i and j point to mismatching characters
- testing for a pattern match
- beginning at position i-j1 in the text string
- then the next possible position for a pattern
match - is beginning at position i-nextj1.
- the first nextj-1 characters at that position
match - the first nextj-1 characters of the pattern,
- so theres no need to back up the i pointer that
far - we can simply leave the i pointer unchanged
- and set the j pointer to next j
19Knuth-Morris-Pratt Algorithm
- The pseudocode and explanation
int initnext(int M, int N) int i, j il j0
next1 0 repeat if (j0) or
(pipj) then i j nextij else
jnextj until igtM
- Just after i and j are incremented,
- the first j-1 characters of the pattern match
- the characters in positions p i-j- 1. .i-1
- the last j-1 characters in the first i-1
characters of the pattern. - this is the largest j with this property,
- otherwise a possible match of the pattern with
itself would be missed. - Thus, j is exactly the value to be assigned to
next i.
20Advantages
- KMP runs in optimal time O(mn)
- very fast
- The algorithm never needs
- to move backwards in the input text, T
- this makes the algorithm good
- for processing very large files
- read in from external devices
- or through a network stream
21Disadvantages
- KMP doesnt work so well
- as the size of the alphabet increases
- more chance for a mismatch
- mismatches tend to occur
- early in the pattern
- but KMP is faster
- when the mismatches occur later
22Boyer-Moore Algorithm
- a significantly faster string searching method
- scan the pattern from right to left
- when trying to match it against the text.
- When searching for our sample pattern 10100111
- if we find matches
- on the eighth, seventh, and sixth character
- but not on the fifth,
- then we can immediately slide
- the pattern seven positions to the right,
- and check the fifteenth character next,
23Heuristics Used
- based on two heuristics
- 1. The looking-glass technique
- find P in T by moving backwards
- through P, starting at its end
- 2. The character-jump technique
- when a mismatch occurs at Ti x
- the character in pattern Pj
- is not the same as Ti
T
a
x
i
P
b
a
j
There are 3 possible cases, tried in order.
24Case 1
- If P contains x somewhere,
- then try to shift P right
- to align the last occurrence of x in P with Ti.
T
T
?
?
a
a
x
x
i
inew
and move i and j right, so j at end
P
P
a
x
a
x
b
b
c
c
j
jnew
25Case 2
- If P contains x somewhere,
- but a shift right
- to the last occurrence is not possible,
- Then shift P right by 1 character to Ti1.
T
T
?
a
x
x
x
x
a
i
inew
and move i and j right, so j at end
P
P
a
x
c
a
x
c
w
w
j
jnew
x is after j position
26Case 3
- If cases 1 and 2 do not apply,
- then shift P to align P0 with Ti1.
T
T
?
?
a
a
x
x
?
i
inew
and move i and j right, so j at end
P
P
a
d
a
d
b
b
c
c
0
j
jnew
No x in P
27Boyer-Moore Algorithm
- Pseudocode and Explanation
int initnext(int M, int N) int i, j iM
jM repeat if (aipb) then i--
j-- else iiM-j1 jM if
(skipindex(ai)gtM-j1) then
iiskipindex(ai)-(M-jl) until (jltl) or
(igtN) return i1
- It simply improves a brute-force right-to-left
pattern scan by using an array skip - tells, for each character in the alphabet,
- how far to skip if that character appears in the
text and causes a mismatch
28Boyer-Moore Example (1)
T
P
29Boyer-Moore Example (2)
T
P
30Boyer-Moore Algorithm Analysis
- Boyer-Moore worst case running time is
- O(nm A)
- Boyer-Moore is fast
- when the alphabet (A) is large,
- slow
- when the alphabet is small.
- e.g. good for English text, poor for binary
- Boyer-Moore is
- significantly faster than brute force
- for searching English text.
31Worst Case Example
T
P
32Chapter 8Text Processing