CSE 024: Design - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

CSE 024: Design

Description:

CSE 024: Design & Analysis of Algorithms Chapter 8: String Searching Sedgewick Chp:19 – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 33
Provided by: Yusu73
Category:
Tags: cse | design | matching | pattern

less

Transcript and Presenter's Notes

Title: CSE 024: Design


1
CSE 024 Design Analysis of Algorithms
  • Chapter 8 String Searching
  • Sedgewick Chp19

2
Course Content
  1. Introduction, Algorithmic Notation and Flowcharts
    (Brassard Bratley Chp Chapter 3)
  2. Efficiency of Algorithms (Brassard Bratley Chp
    Chapter 2)
  3. Basic Data Structures (Brassard Bratley Chp
    Chapter 5)
  4. Sorting (Weiss Chp 7)
  5. Searching (Brassard Bratley Chp Chapter 9)
  6. Graph Algorithms (Weiss Chp 9)
  7. Randomized Algorithms (Weiss Chp 10)
  8. String Searching (Sedgewick Chp19)
  9. NP Completeness (Sedgewick Chap. 40)

3
Lecture Content
  • Strings Applications
  • String Operations
  • What is Pattern Matching?
  • Brute-Force Algorithm
  • Knuth-Morris-Pratt Algorithm
  • Boyer-Moore Algorithm

4
String Applications
  • Processing text is dealing with character strings
  • Come from wide variety of sources
  • DNA Applications
  • PCGTAAACTGCTTTAATCAAACGC
  • News headline
  • RU.S. Men Win Soccer World Cup!
  • URL of Web site
  • Shttp//www.wiley.com/college/goodrich
  • String operations
  • Breaking large strings into smaller strings
  • Pattern matching

5
String Concepts
  • Assume S is a string of size m.
  • A substring Si .. j of S is the string fragment
    between indexes i and j.
  • A prefix of S is a substring S0 .. i
  • A suffix of S is a substring Si .. m-1
  • i is any index between 0 and m-1

6
Examples
S
a
n
d
r
e
w
0
5
  • Substring S1..3 "ndr"
  • All possible prefixes of S
  • "andrew", "andre", "andr", "and", "an, "a"
  • All possible suffixes of S
  • "andrew", "ndrew", "drew", "rew", "ew", "w

7
What is Pattern Matching?
  • Definition
  • given a text string T and a pattern string P,
    find the pattern inside the text
  • T the rain in spain stays mainly on the plain
  • P n th
  • Applications
  • text editors,
  • Web search engines
  • e.g. Google
  • image analysis

8
Brute-Force Algorithm
  • Check each position in the text T
  • to see if the pattern P starts in that position

T
a
n
d
r
e
w
T
a
n
d
r
e
w
r
e
w
P
r
e
w
P
P moves 1 char at a time through T
. . . .
9
Brute-Force Algorithm
  • The obvious method for pattern matching
  • Just check, for each possible position in the
    text
  • at which the pattern could match
  • Search for the first occurrence of
  • a pattern p 1. .M in a text string a 1. .N
  • keep one pointer (i) into the text,
  • another pointer (j) into the pattern.
  • As long as they point to matching characters,
  • both pointers are incremented.
  • If the end of the pattern is reached (jgtM),
  • then a match has been found.
  • If i and j point to mismatching characters,
  • then j is reset to point to the beginning of the
    pattern
  • i is reset to correspond to moving the pattern to
    the right one position for matching against the
    text.
  • If the end of the text is reached (igtN)
  • then there is no match.
  • If the pattern does not occur in the text,
  • the value Ni is returned.

int bruteSearch(int M, int N) int i,j i1
j1 repeat if aipj then i j else
ii-j2 j1 until (jgtM) or (igtN) if jgtM
then return i-M else return i
10
Analysis of Algorithm
  • Brute force pattern matching
  • runs in time O(mn) in the worst case.
  • But most searches of ordinary text
  • take O(mn), which is very quick.
  • The brute force algorithm is fast
  • when the alphabet of the text is large
  • e.g. A..Z, a..z, 1..9, etc.
  • It is slower
  • when the alphabet is small
  • e.g. 0, 1 (as in binary files, image files, etc.)

11
Algorithm Performance
  • In a text-editing application,
  • the inner loop of this program is seldom
    iterated,
  • the running time is very nearly proportional
  • to the number of text characters examined O(n)
  • For example, look for the pattern STING in
  • A STRING SEARCHING EXAMPLE CONSISTING OF SIMPLE
    TEXT
  • the statement jj1 is executed only four times
  • once for each S
  • twice for the first ST before the match

12
Algorithm Performance Binary Strings
  • Example
  • pattern is 00000001
  • the text string is
  • 00000000000000000000000000000000000000000000000000
    001
  • then j is incremented
  • 745 (315) times before the match
  • search for 10010111 in the binary string
  • 100111010010010010010111000111
  • 1001
  • 1
  • 10
  • 10010
  • 10010
  • 10010
  • 10010111

13
Knuth-Morris-Pratt Algorithm
  • The basic idea is
  • when a mismatch is detected,
  • our false start consists of characters
  • we know in advance
  • since they are in the pattern.
  • take the advantage of this information
  • instead of backing up the i pointer
  • over all those known characters

14
The Knuth-Morris-Pratt (KMP) algorithm
  • looks for the pattern in the text
  • in a left-to-right order
  • like the brute force algorithm
  • But it shifts the pattern more intelligently
  • If a mismatch occurs
  • between the text and pattern P at Pj,
  • what is the most we can shift the pattern
  • to avoid wasteful comparisons?
  • Answer the largest prefix of P0 .. j-1
  • that is a suffix of P1 .. j-1

15
KMP Algorithm is Generalization
  • Assume first character in the pattern
  • doesnt appear again in the pattern
  • say the pattern is 10000000
  • Suppose we have a false start j characters long
  • at some position in the text.
  • When the mismatch is detected,
  • j characters have already matched,
  • No need to back up the text pointer i
  • none of the previous j-1 characters in the text
  • can match the first character in the pattern.
  • This change could be implemented by replacing
  • ii-j2 in the program above by i
  • The practical effect of this change is limited
  • such a specialized pattern is not particularly
    likely to occur,
  • but the idea is worth of thinking
  • Knuth-Morris-Pratt algorithm is a generalization.
  • it is always possible to arrange things
  • so that the i pointer is never decremented.

16
  • Fully skipping past the pattern
  • on detecting a mismatch wont work
  • when the pattern could match itself
  • at the point of the mismatch.
  • For example,
  • searching for 10100111 in 1010100111
  • First mismatch at the fifth character,
  • but it is better to back up
  • to the third character to continue the search,
  • since otherwise we would miss the match.
  • But we can figure out ahead of time what to do
  • because it depends only on the pattern

17
Example
i
T
P
j 5
jnew 2
18
Knuth-Morris-Pratt Algorithm
  • The pseudocode and explanation

int kmpsearch(int M, int N) int i,j i1
j1 repeat if (j0) or (aipb) then i
j else jnextj until (jgtM) or (igtN) if
jgtM then return i-M else return 1
  • When i and j point to mismatching characters
  • testing for a pattern match
  • beginning at position i-j1 in the text string
  • then the next possible position for a pattern
    match
  • is beginning at position i-nextj1.
  • the first nextj-1 characters at that position
    match
  • the first nextj-1 characters of the pattern,
  • so theres no need to back up the i pointer that
    far
  • we can simply leave the i pointer unchanged
  • and set the j pointer to next j

19
Knuth-Morris-Pratt Algorithm
  • The pseudocode and explanation

int initnext(int M, int N) int i, j il j0
next1 0 repeat if (j0) or
(pipj) then i j nextij else
jnextj until igtM
  • Just after i and j are incremented,
  • the first j-1 characters of the pattern match
  • the characters in positions p i-j- 1. .i-1
  • the last j-1 characters in the first i-1
    characters of the pattern.
  • this is the largest j with this property,
  • otherwise a possible match of the pattern with
    itself would be missed.
  • Thus, j is exactly the value to be assigned to
    next i.

20
Advantages
  • KMP runs in optimal time O(mn)
  • very fast
  • The algorithm never needs
  • to move backwards in the input text, T
  • this makes the algorithm good
  • for processing very large files
  • read in from external devices
  • or through a network stream

21
Disadvantages
  • KMP doesnt work so well
  • as the size of the alphabet increases
  • more chance for a mismatch
  • mismatches tend to occur
  • early in the pattern
  • but KMP is faster
  • when the mismatches occur later

22
Boyer-Moore Algorithm
  • a significantly faster string searching method
  • scan the pattern from right to left
  • when trying to match it against the text.
  • When searching for our sample pattern 10100111
  • if we find matches
  • on the eighth, seventh, and sixth character
  • but not on the fifth,
  • then we can immediately slide
  • the pattern seven positions to the right,
  • and check the fifteenth character next,

23
Heuristics Used
  • based on two heuristics
  • 1. The looking-glass technique
  • find P in T by moving backwards
  • through P, starting at its end
  • 2. The character-jump technique
  • when a mismatch occurs at Ti x
  • the character in pattern Pj
  • is not the same as Ti

T
a
x
i
P
b
a
j
There are 3 possible cases, tried in order.
24
Case 1
  • If P contains x somewhere,
  • then try to shift P right
  • to align the last occurrence of x in P with Ti.

T
T
?
?
a
a
x
x
i
inew
and move i and j right, so j at end
P
P
a
x
a
x
b
b
c
c
j
jnew
25
Case 2
  • If P contains x somewhere,
  • but a shift right
  • to the last occurrence is not possible,
  • Then shift P right by 1 character to Ti1.

T
T
?
a
x
x
x
x
a
i
inew
and move i and j right, so j at end
P
P
a
x
c
a
x
c
w
w
j
jnew
x is after j position
26
Case 3
  • If cases 1 and 2 do not apply,
  • then shift P to align P0 with Ti1.

T
T
?
?
a
a
x
x
?
i
inew
and move i and j right, so j at end
P
P
a
d
a
d
b
b
c
c
0
j
jnew
No x in P
27
Boyer-Moore Algorithm
  • Pseudocode and Explanation

int initnext(int M, int N) int i, j iM
jM repeat if (aipb) then i--
j-- else iiM-j1 jM if
(skipindex(ai)gtM-j1) then
iiskipindex(ai)-(M-jl) until (jltl) or
(igtN) return i1
  • It simply improves a brute-force right-to-left
    pattern scan by using an array skip
  • tells, for each character in the alphabet,
  • how far to skip if that character appears in the
    text and causes a mismatch

28
Boyer-Moore Example (1)
T
P
29
Boyer-Moore Example (2)
T
P
30
Boyer-Moore Algorithm Analysis
  • Boyer-Moore worst case running time is
  • O(nm A)
  • Boyer-Moore is fast
  • when the alphabet (A) is large,
  • slow
  • when the alphabet is small.
  • e.g. good for English text, poor for binary
  • Boyer-Moore is
  • significantly faster than brute force
  • for searching English text.

31
Worst Case Example
T
  • T "aaaaaa"
  • P "baaaaa"

P
32
Chapter 8Text Processing
Write a Comment
User Comments (0)
About PowerShow.com