String matching - PowerPoint PPT Presentation

About This Presentation
Title:

String matching

Description:

Knuth-Morris-Pratt 77: deterministic Karp-Rabin 81: randomized Digression: what is the difference between Algorithm that is fast on a random input ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 14
Provided by: icsUciEd8
Learn more at: https://ics.uci.edu
Category:
Tags: matching | string

less

Transcript and Presenter's Notes

Title: String matching


1
String matching
2
Exact String Matching
  • Input Two strings T1n and P1m, containing
    symbols from alphabet ?.
  • Example
  • ? A,C,G,T
  • T112 CAGTACATCGAT
  • P1..3 AGT
  • Goal find all shifts 0s n-m such that
    Ts1sm P

3
Simple Algorithm
  • for s ? 0 to n-m
  • Match ? 1
  • for j ? 1 to m
  • if Tsj?Pj then
  • Match ? 0
  • exit loop
  • if Match1 then output s

4
Analysis
  • Running time of the simple algorithm
  • Worst-case O(nm)
  • Average-case (random text) O(n) (expectation)
  • Ts time spend on checking shift s
  • (the number of comparisons until 1st mismatch)
  • ETs lt 2 (why)
  • ESsTs SsETs O(n)

5
Worst-case
  • Is it possible to achieve O(n) for any input ?
  • Knuth-Morris-Pratt77 deterministic
  • Karp-Rabin81 randomized
  • Digression what is the difference between
  • Algorithm that is fast on a random input (as seen
    on the previous slide)
  • Randomized algorithm (as in the rest of this
    lecture)

6
Karp-Rabin Algorithm
  • Idea semi-numerical approach
  • Consider all m-mers
  • T1m, T2m1, , Tm-n1n
  • Map each Ts1sm into a number ts
  • Map the pattern P1m into a number p
  • Report the m-mers that map to the same value as p
  • Problem how to map all m-mers in O(n) time ?

7
Implementation
  • Attempt I
  • Assume S0,1
  • (for A,C,G,T convert A00, C01, G10, T11)
  • Think about each Ts1sm as a number in binary
    representation, i.e.,
  • tsTs12m-1Ts22m-2Tsm20
  • Find a fast way of computing ts1 given ts
  • Output all s such that ts is equal to the number
    p represented by P

8
Magic formula
  • How to transform
  • tsTs12m-1Ts22m-2Tsm20
  • into
  • ts1Ts22m-1Ts32m-2Tsm120 ?
  • Three steps
  • Subtract Ts12m-1
  • Multiply by 2 (i.e., shift the bits by one
    position)
  • Add Tsm120
  • Therefore ts1 (ts- Ts12m-1)2 Tsm120

9
Algorithm
  • ts1 (ts- Ts12m-1)2 Tsm120
  • Can compute ts1 from ts using 3 arithmetic
    operations
  • Therefore, we can compute all t0,t1,,tn-m using
    O(n) arithmetic operations
  • We can compute a number corresponding to P using
    O(m) arithmetic operations
  • Are we done ?

10
Problem
  • To get O(n) time, we would need to perform each
    arithmetic operation in O(1) time
  • However, the arguments are m-bit long !
  • If m large, it is unreasonable to assume that
    operations on such big numbers can be done in
    O(1) time
  • We need to reduce the number range to something
    more managable

11
Hashing
  • We will instead compute
  • tsTs12m-1Ts22m-2Tsm20 mod q where
    q is an appropriate prime number
  • One can still compute ts1 from ts
  • ts1 (ts- Ts12m-1)2Tsm120 mod q
  • If q is not large, we can compute all ts (and
    p) in O(n) time

12
Problem
  • Unfortunately, we can have false positives, i.e.,
    Ts1sm?P but ts mod q p mod q
  • Our approach
  • Use a random q
  • Show that the probability of a false positive is
    small
  • ? randomized algorithm

13
Aligning two sequences
  • Longest common substring (LCS) problem (no gaps)
  • Input Two strings T1n and P1m, nm
  • Goal Largest k such that
  • Ti1ik Pj1jk
  • for some i,j
  • How can we solve this problem efficiently ?
  • Hint How can we check if LCS has length k ?
Write a Comment
User Comments (0)
About PowerShow.com