String Matching - PowerPoint PPT Presentation

About This Presentation
Title:

String Matching

Description:

Title: PowerPoint Presentation Last modified by: Prof. WANG Lusheng Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 22
Provided by: eduh131
Category:

less

Transcript and Presenter's Notes

Title: String Matching


1
String Matching
  • The problem
  • Input a text T (very long string) and a pattern
    P (short string).
  • Output the index in T where a copy of P begins.

2
Some Notations and Terminologies
  • P and T the lengths of P and T.
  • Pi the i-th letter of P.
  • Prefix of P a substring of P starting with P1.
  • P1..i the prefix containing the first i
    letters of P.
  • Example abcabbccaa.
  • prefix a, ab, abc, abca, abcab, abcabb, .

3
Some Notations and Terminologies
  • suffix of P1..i a substring of P1..i ending
    at Pi, e.g. P3..i, P5..i (igt4).
  • Example P1..5abcaa.
  • Suffix of P1.. 3 c, bc, abc.
  • Suffix of P1..4 a, ca, bca, abca.

4
Straightforward method
  • Basic idea
  • 1. i1
  • 2. Start with Ti and match P with
  • Ti,Ti1, ... TiP-1
  • P1 P2 PP
  • 3. whenever a mismatch is found,
  • ii1 and goto 2 until
    iP-1ltT.
  • Example 1 TABABABCCA and PABABC
  • P ABABC A ABABC
  • T ABABABCCA ABABABCCA ABABABCCA

5
Analysis
  • Step 2 takes O(P) comparisons in the worst
    case.
  • Step 2 could be repeated O(T) times.
  • Total running time is O(TP).

6
Knuth-Morris-Pratt Method (linear time algorithm)
  • A better idea
  • In step 3, when there is a mismatch we move
    forward one position (ii1).
  • We may move more than one position at a time when
    a mismatch occurs. (carefully study the pattern
    P).
  • For example
  • P ABABC ABA
  • T ABABABCCA ABABABCCA

7
  • Questions
  • How to decide how many positions we should jump
    when a mismatch occurs?
  • How much we can benefit? O(TP).
  • Example 2
  • P abcabcabcaa
  • T abcabcabcabcaa
  • abcabcab
  • back here

8
  • We can move forward more than one position.
    Reason?
  • Study of Pattern P
  • P1..7 abcabca
  • P1..10 abcabcabca (when trying to P11, we
    have a mismatch)
  • P1..7 abcabca
  • P1..4 abca
  • P1..7 is the longest prefix that is also a
    suffix of P1..10.
  • P1..4 is a prefix that is a suffix of P1..10,
    but not the longest.
  • Key When mismatch occurs at Pi1, we want to
    find the longest prefix of P1..i which is also
    a suffix of P1..i.

9
Failure function
  • f(i) is the largest r with (rlti) such that
  • P1 P2 ...Pr Pi-r1Pi-r2, ...,
    Pi.
  • Prefix of length r Suffix of
    P1P2Pi of length r
  • That is, P1..f(i) is the longest prefix that is
    a suffix of P1..i.
  • Example 3 Pababaccc and i5.
  • P1 P2 P3
  • a b a
  • a b a b a
  • P3 P4 P5 (r3) f(5)3.

10
  • Example 4
  • Pabcabbabcabbaa
  • It is easy to verify that
  • f(1)0, f(2)0, f(3)0, f(4)1, f(5)2,
  • f(6)0, f(7)1, f(8)2, f(9)3, f(10)4,
  • f(11)5, f(12)6, f(13)7, f(14)1.

11
The Scan Algorithm(draw a figure to show)
  • i indicates that Ti is the next character in T
    to be compared with the right end of the
    pattern.
  • q indicates that Pq1 is the next character in
    P to be compared with Ti.
  • i1 and q0
  • Compare Ti with Pq1case 1
    TiPq1 ii1qq1 if qP then print
    "P occurs at i1-P qf(q)case 2
    Ti?Pq1 and q?0 qf(q) case 3 Ti?Pq1
    and q0 ii1
  • Repeat step2 until iT.

12
  • Example 5 Pabcabbabcabbaa
  • Tabcabcabbabbabcabbabcabbaa
  • abcabb
  • abcabbabc
  • abc
  • a(ii1)
  • abcabbabcabbaa(q1p)

i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
f(i) 0 0 0 1 2 0 1 2 3 4 5 6 7 1
13
Running time complexity(hard)
  • The running time of the scan algorithm is O(T).
  • Proof
  • There are two pointers i and p.
  • i the next character in T to be compared.
  • p the position of P1. (See figure below)
  • p i
  • Pabcabcabcaa
  • Tabcabcabcabcaa
  • P abcabcaa
  • p

14
  • Facts
  • 1 When a match is found, move i forward.
  • 2 When a mismatch is found, move p forward until
    p and i are the same. (When pi and a mismatch
    occur, move both i and p forward)
  • From facts 1 and 2, it is easy to see that the
    total number of comparisons is at most 2T.
  • Thus, the time complexity is O(T).

15
Another version of scan algorithm (code)
  • nT
  • mP
  • q0
  • for i1 to n
  • while qgt0 and Pq1?Ti do
  • qf(q)
  • if Pq1Ti then
  • qq1
  • if qm then
  • print "pattern occurs at i-m1"
  • qf(q)

16
Failure Function Construction
  • Basic idea
  • Case 1 f(1) is always 0.
  • Case 2 if PqPf(q-1)1 then f(q)f(q-1)1.
  • Example pabcabcc
  • abc
  • f(1)0 f(2)0 f(3)0 f(4)1 f(5)2 f(6)3
    f(7)0
  • P4 Pf(4-1)1, f(4)f(4-1)11.
  • P5 Pf(5-1)1, f(5)f(5-1)1112.
  • P6 Pf(6-1)1. F(6)f(6-1)1213.

17
  • Case 3 if Pq?Pf(q-1)1 and f(q-1)?0 then
    consider Pq ? Pf(f(q-1))1 (Do it
    recursively)
  • Case 4 if Pq ? Pf(q-1)1 and f(q-1)0 then
    fq0.
  • Example abc abc abb
  • abc abc f(8)5
  • abc f(5)2
  • a
    f(2)0
  • i 1 2 3 4 5 6 7 8 9
  • f(i) 0 0 0 1 2 3 4 5 0

18
The algorithm (code) to compute failure function
  • 1. mP
  • 2. f(1)0
  • 3. k0
  • 4. for q2 to P do
  • 5. kf(q-1)
  • 6. if(kgt0 and Pk1!Pq)
  • kf(k) goto 6
  • 7. if(kgt0 and Pk1Pq)
  • fqk1
  • 8. if(k0)
  • if(Pk1Pq fq1
  • else fq0

19
Another version
  • 1. mP
  • 2. f(1)0
  • 3. k0
  • 4. for q2 to P do
  • 5. kf(q-1)
  • 6. while(kgt0 and Pk1!Pq) do
  • 7. kf(k)
  • 8. if(Pk1Pq) then kk1
  • 9. fqk

20
  • Example 3
  • 1 2 3 4 5 6 7 8 9 10 11 12
  • Pa b c a b c a b c a a c
  • f(1)0 f(2)0 f(3)0 f(4)1 f(5)2
  • f(6)3 f(7)4 f(8)5 f(9)6 f(10)7
  • f(11)1.
  • (The computation of f(11) is very interesting.)
  • Question Do we need to compute f(12)?
  • Yes, if you want to find ALL occurrences of P.
  • No, if you just want to find the first occurrence
    of P.

21
  • Example
  • Pabcabc
  • Tabcabcabc
  • abcabc
  • abcabc
  • When a match is found at the end of P, call
    f(p).
  • Running time complexity (Fun Part, not required)
  • The running time of failure function construction
    algorithm is O(P). (The proof is similar to
    that for scan algorithm.)
  • Total running time complexity
  • The total complexity for failure function
    construction and scan algorithm is O(PT).

i 1 2 3 4 5 6
f(i) 0 0 0 1 2 3
Write a Comment
User Comments (0)
About PowerShow.com