String Matching - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

String Matching

Description:

The patterns --- Data structures for the patterns. Dynamic programming ... The edit distance d between two strings is the. minimum number of ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 28
Provided by: lcl2
Category:

less

Transcript and Presenter's Notes

Title: String Matching


1
String Matching
String matching definition of the problem
(text,pattern)
depends on what we have text or patterns
  • Exact matching
  • The patterns ---gt Data structures for the
    patterns
  • 1 pattern ---gt The algorithm depends on p and
    ?
  • k patterns ---gt The algorithm depends on k, p
    and ?
  • Extensions
  • Regular Expressions
  • The text ----gt Data structure for the text
    (suffix tree, ...)
  • Approximate matching
  • Dynamic programming
  • Sequence alignment (pairwise and multiple)
  • Sequence assembly hash algorithm
  • Probabilistic search

Hidden Markov Models
2
Approximate string matching
For instance, given the sequence CTACTACTACGTGACT
AATACTGATCGTAGCTAC search for the pattern ACTGA
allowing one error
but what is the meaning of one error?
3
Edit distance
We accept three types of errors
1. Mismatch ACCGTGAT ACCGAGAT
2. Insertion ACCGTGAT ACCGATGAT
3. Deletion ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of substitutions,insertions and
deletions needed to transform the first string
into the second one
d(ACT,ACT) d(ACT,AC) d(ACT,C) d(ACT
,) d(AC,ATC)
d(ACTTG,ATCTG)
4
Edit distance
We accept three types of errors
1. Mismatch ACCGTGAT ACCGAGAT
2. Insertion ACCGTGAT ACCGATGAT
3. Deletion ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of substitutions,insertions and
deletions needed to transform the first string
into the second one
d(ACT,ACT) d(ACT,AC) d(ACT,C) d(ACT,
) d(AC,ATC)
d(ACTTG,ATCTG)
0
1
2
3
1
2
5
Edit distance and alignment of strings
The Edit distance is related with the best
alignment of strings
Given d(ACT,ACT)0 d(ACT,AC)1
d(ACTTG,ATCTG)2 which is the best alignment in
every case?
  • ACT and ACT ACT
  • ACT
  • ACT and AT ACT
  • A -T
  • ACTTG and ATCTG

ACTTG ATCTG
ACT - TG A - TCTG
Then, the alignment suggest the substitutions,
insertions and deletions to transform one string
into the other
6
Edit distance and alignment of strings
But which is the distance between the
strings ACGCTATGCTATACG and ACGGTAGTGACGC?
and the best alignment between them?
1966 was the first time this problem was
discussed
and the algorithm was proposed in 1968,1970,
using the technique called Dynamic programming
7
Edit distance and alignment of strings
C T A C T A C T A C G T A C T G A
8
Edit distance and alignment of strings
C T A C T A C T A C G T A C T G A
9
Edit distance and alignment of strings
C T A C T A C T A C G T A C T G A
The cell contains the distance between AC and
CTACT.
10
Edit distance and alignment of strings
C T A C T A C T A C G T A C T
G A
?
11
Edit distance and alignment of strings
C T A C T A C T A C G T 0 A C T
G A
?
12
Edit distance and alignment of strings
C T A C T A C T A C G T 0 1 A C T
G A
?
- C
13
Edit distance and alignment of strings
C T A C T A C T A C G T 0 1 2 A C
T G A
?
- - CT
14
Edit distance and alignment of strings
C T A C T A C T A C G T 0 1 2 3 4 5
6 7 8 A C T G A
- - - - - - CTACTA
15
Edit distance and alignment of strings
C T A C T A C T A C G T 0 1 2 3 4 5
6 7 8 A ? C ? T ? G A
16
Edit distance and alignment of strings
C T A C T A C T A C G T 0 1 2 3 4 5
6 7 8 A 1 C 2 T 3 G A
ACT - - -
17
Edit distance and alignment of strings
C T A C T A C T A C G T 0 1 2 3 4 5
6 7 8 A 1 C 2 T 3 G A
C T A C T A C T A C G T

A
C

T G A
d(AC,CTA)1
d(A,CTA)
BA(AC,CTAC) best
d(AC,CTAC)min
d(A,CTAC)1
18
Edit distance and alignment of strings
C T A C T A C T A C G T

A
C

T G A
C T A C T A C T A C G T 0 1 2 3 4 5
6 7 8 A 1 C 2 T 3 G A

d(A,CTAC)1 d(AC,CTACT)minimum d(A,CTA)
..1
d(AC,CTA)1
19
Edit distance and alignment of strings
  • Connect to
  • http//alggen.lsi.upc.es/docencia/ember/leed/Tfc1.
    htm
  • and use the global method.

20
Edit distance and alignment of strings
  • How this algorithm can be applied
  • to the approximate search?

to the K-approximate string searching?
21
K-approximate string searching
C T A C T A C T A C G T A C T G G T G A A
A C T G A
This cell
22
K-approximate string searching
C T A C T A C T A C G T A C T G G T G A A
A C T G A
This cell gives the distance between (ACTGA,
CTGTA)
but we only are interested in the last
characters
23
K-approximate string searching
C T A C T A C T A C G T A C T G G T G A A
A C T G A
This cell gives the distance between (ACTGA,
CTGTA)
but we only are interested in the last
characters
24
K-approximate string searching
C T A C G T A C T G G T
G A A A C T G A
This cell gives the distance between (ACTGA,
CTGTA)
but we only are interested in the last
characters
no matter where they appears in the text, then
25
K-approximate string searching
C T A C G T A C T G G T
G A A 0 A C T G A
This cell gives the distance between (ACTGA,
CTGTA)
but we only are interested in the last
characters
no matter where they appears in the text, then
26
K-approximate string searching
C T A C G T A C T G G T
G A A 0 A C T G A
This cell gives the distance between (ACTGA,
CTGTA)
but we only are interested in the last
characters
no matter where they appears in the text, then
27
K-approximate string searching
C T A C T A C T A C G T A C T G G T G A A
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 A C T G A
This cell gives the distance between (ACTGA,
CTGTA)
but we only are interested in the last
characters
no matter where they appears in the text, then
28
K-approximate string searching
  • Connect to
  • http//alggen.lsi.upc.es/docencia/ember/leed/Tfc1.
    htm
  • and use the semi-global method.
Write a Comment
User Comments (0)
About PowerShow.com