Title: Master Course
1Master Course
MSc Bioinformatics for Health Sciences H15
Algorithms on strings and sequences Xavier
Messeguer Peypoch (http//www.lsi.upc.es/alggen)
Dep. de Llenguatges i Sistemes
Informàtics CEPBA-IBM Research Institute Universi
tat Politècnica de Catalunya
2Contents
1. (Exact) String matching of one pattern
2. (Exact) String matching of many patterns
3. Approximate string matching (Dynamic
programming)
4. Pairwise and multiple alignment
5. Suffix trees
3Contents and bibliography
1. (Exact) String matching of one pattern
2. (Exact) String matching of many patterns
3. Approximate string matching (Dynamic
programming)
4. Pairwise and multiple alignment
5. Suffix trees
- Flexible pattern matching in strings
- G. Navarro and M. Raffinot, 2002, Cambridge Uni.
Press
- Algorithms on strings, trees and sequences
- D. Gusfield, Cambridge University Press, 1997
4String matching
Definition given a long text T and a set of k
patterns p1,p2,,pk, the string matching problem
is to find all the ocurrences of all the
patterns in the text T.
On-line algorithms the patterns are
known. Off-line algorithms the text is known.
- Only one pattern (exact and approximated)
- Five, ten, hundred, thusand,.. patterns (exact)
- Extended patterns
5Master Course
First lecture First part (Exact) string
matching of one pattern
6String matching one pattern
How does the string algorithms made the search?
For instance, given the sequence CTACTACTACGTCTAT
ACTGATCGTAGCTACTACATGC search for the pattern
ACTGA.
and for the pattern TACTACGGTATGACTAA
7String Matching Brute force algorithm
Example
Given the pattern ATGTA, the search is
G T A C T A G A G G A C G T A T G T A C T G ...
8String Matching Brute force algorithm
- Connect to
- http//www-igm.univ-mlv.fr/lecroq/string/index.ht
ml - and open Brute Force algorithm
What is the meaning of the variables? y
n x
m
What is the meaning of the variables? y array
with the text T n length of the text x
array with the pattern P mlength of the
pattern
C code of the running file
Connect to http//www.lsi.upc.edu/peypoch
9String Matching of one pattern
- The cost of Brute Force algorithm is O(nm).
Can the search be made with lower cost?
CTACTACTACGTCTATACTGATCGTAGCTACTACATGC
TACTACGGTATGACTAA
10String matching of one pattern
How does the string algorithms made the search?
At each step the comparison is made and the
window is shifted to the right.
Which are the facts that differentiate the
algorithms?
- How the comparison is made.
- The length of the shift.
11String Matching Brute force algorithm
The cost is O(mn).
12String Matching one pattern
Most efficient algorithms (Navarro Raffinot)
BNDM Backward Nondeterministic Dawg Matching
?
BOM Backward Oracle Matching
64
32
16
Horspool
8
BOM
BNDM
4
2
w
Length of the pattern
2 4 8 16
32 64 128
256
13String Matching Horspool algorithm
It depends of where appears the last letter of
the text, say it a, in the pattern
Then it is necessary a preprocess that determines
the length of the shift.
14String Matching Horspool algorithm
Example
Given the pattern ATGTA, the shift table is
15String Matching Horspool algorithm
Example
Given the pattern ATGTA, the shift table is
16String Matching Horspool algorithm
- Connect to
- http//www-igm.univ-mlv.fr/lecroq/string/index.ht
ml - and open the Horspool algorithm
C code
Connect to http//www.lsi.upc.edu/peypoch
17String Matching one pattern
The most efficient algorithms (Navarro Raffinot)
BNDM Backward Nondeterministic Dawg Matching
?
BOM Backward Oracle Matching
64
32
16
Horspool
8
BOM
BNDM
4
2
w
Length of the pattern
2 4 8 16
32 64 128
256
18BNDM algorithm
x
How the next state can be obtained?
?
19BNDM algorithm example
Given the pattern ATGTA,
20BNDM algorithm example
Given the pattern ATGTA,
21BNDM algorithm example
Given the pattern ATGTA,
D1
( 0 1 0 1 0 )
D2 ( 1 0 1 0 0 ) ( 0 0 0 0 0 ) ( 0 0 0 0 0 )
D1
( 0 0 1 0 0 )
D2 ( 0 1 0 0 0 ) ( 0 0 1 0 0 ) ( 0 0 0 0 0 )
D1
( 1 0 0 0 1 )
D2 ( 0 0 0 1 0 ) ( 0 1 0 1 0 ) ( 0 0 0 1 0 )
D3 ( 0 0 1 0 0 ) ( 0 0 1 0 0) ( 0 0 1 0 0 )
D4 ( 0 1 0 0 0 ) ( 0 0 0 0 0) ( 0 0 0 0 0 )
22BNDM algorithm example
D1
( 1 0 0 0 1 )
D2 ( 0 0 0 1 0 ) ( 0 1 0 1 0 ) ( 0 0 0 1 0 )
D3 ( 0 0 1 0 0 ) ( 0 0 1 0 0 ) ( 0 0 1 0 0 )
D4 ( 0 1 0 0 0 ) ( 0 1 0 1 0 ) ( 0 1 0 0 0 )
D5 ( 1 0 0 0 0 ) ( 1 0 0 0 1 ) ( 1 0 0 0 0 )
D6 ( 0 0 0 0 0 ) ( ) ( 0 0 0 0
0 )
Pattern found!
23BNDM algorithm
?
24BNDM algorithm
If the left bit is set to one in step i, it means
that a prefix of P of length i is equal to a
suffix of T, then the window is shifted m-i
cells otherwise it is shifted m cells
25String matching one pattern
The most efficient algorithms (Navarro Raffinot)
BNDM Backward Nondeterministic Dawg Matching
?
BOM Backward Oracle Matching
64
32
16
Horspool
8
BOM
BNDM
4
Long. patró
2
w
2 4 8 16
32 64 128
256
26BOM (Backward Oracle Matching)
?
27Automaton Factor Oracle properties
Factor Oracle of the word G T A T G T A
G T A T
G
T A T
G
A T
G
T G
G
but the automaton also recognizes other strings
as G T G
then it is usefull only for discard words out as
factors!
28BOM example
- How the comparison is made?
- The Factor Oracle of the inverted pattern is
built. Given the pattern ATGTATG
A T G T A T G
29BOM example
- How the comparison is made?
- The Factor Oracle of the inverted pattern is
built. Given the pattern ATGTATG
A T G T A T G
30BOM example
- How the comparison is made?
- The Factor Oracle of the inverted pattern is
built. Given the pattern ATGTATG
A T G T A T G
31BOM example
- How the comparison is made?
- The Factor Oracle of the inverted pattern is
built. Given the pattern ATGTATG
A T G T A T G
32BOM example
- How the comparison is made?
- The Factor Oracle of the inverted pattern is
built. Given the pattern ATGTATG
A T G T A T G
33BOM example
- How the comparison is made?
- Es construeix lautòmata del patró invers
Suposem que el patró és ATGTATG
A T G T A T G
34BOM (Backward Oracle Matching)
35String Matching BNDM and BOM
- Connect to
- http//www-igm.univ-mlv.fr/lecroq/string/index.ht
ml - and open the BNDM and BOM algorithms
C code of BNDM C code of BOM
36Master Course
First lecture Second part (Exact) string
matching of many patterns
37String matching many patterns
Given the sequence CTACTACTACGTCTATACTGATCGTAGCTA
CTACATGC
Search for the patterns ACTGACT GTCT AATT ACTGATC
TTT GTAGC AATACT ACATGC ACTGA.
38Trie
Trie of words GTATGTA,GTAT,TAATA,GTGTA
G
T
T
A
A
T
G
G
T
A
T
A
A
A
T
A
Which is the cost?
39Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
2. lmin4
4. Start the search
40Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
The text ACATGCTATGTGACA
41Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
The text ACATGCTATGTGACA
42Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
The text ACATGCTATGTGACA
43Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
The text ACATGCTATGTGACA
44Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
The text ACATGCTATGTGACA
45Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
The text ACATGCTATGTGACA
46Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
The text ACATGCTATGTGACA
47Horspool to Wu-Manber
How do we can increase the length of the shifts?
With a table shift of l-mers with the patterns
ATGTATG,TATG,ATAAT,ATGTG
48Wu-Manber algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
into the text ACATGCTATGTGACATAATA
Experimental length logS 2lminr
49String matching of many patterns
50String matching of many patterns
?
(5 patterns)
8
Wu-Manber
4
SBOM
Lmin
2
5 10 15 20
25 30 35
40 45
8
Wu-Manber
(10 patterns)
4
SBOM
2
5 10 15 20
25 30 35
40 45
(1000 patterns)
SBOM
8
(100 patterns)
4
2
5 10 15 20
25 30 35
40 45
51Horspool for a set of patterns
- How the comparison is made?
Comparison
Text
Patrons
Automaton with all the patterns
- How the shift is determined?
a
Segons laparició de lúltim carácter del text
ainto the s patrons, concretament la primera
aparició per la dreta no última i més curta que
lmin, o lmin
52String matching of many patterns
53SBOM
?
54Factor Oracle of many patterns
G
G
A
T
T
T
A
G
T
1,4
A
A
T
A
3
2
The AFO of GTATGTA, GTAA, TAATA i GTGTA
55SBOM algorithm
- How the comparison is made?
Text
Patrons
Autòmaton of lenght lmin
- How the shift is determined?
a
- If the a doesnt appears in the AFO
- If lmin characters have been read
56SBOM algorithm example
Search for the patterns ATGTATG, TAATG,TAATAAT i
AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
ACATGCTAGCTATAATAATGTATG
57SBOM algorithm example
Search for the patterns ATGTATG, TAATG,TAATAAT i
AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
ACATGCTAGCTATAATAATGTATG
58SBOM algorithm example
Search for the patterns ATGTATG, TAATG,TAATAAT i
AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
ACATGCTAGCTATAATAATGTATG
59SBOM algorithm example
Search for the patterns ATGTATG, TAATG,TAATAAT i
AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
ACATGCTAGCTATAATAATGTATG
60SBOM algorithm example
Search for the patterns ATGTATG, TAATG,TAATAAT i
AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
ACATGCTAGCTATAATAATGTATG
61SBOM algorithm example
Search for the patterns ATGTATG, TAATG,TAATAAT i
AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
ACATGCTAGCTATAATAATGTATG
62SBOM algorithm example
Search for the patterns ATGTATG, TAATG,TAATAAT i
AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
ACATGCTAGCTATAATAATGTATG
63SBOM algorithm example
Search for the patterns ATGTATG, TAATG,TAATAAT i
AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
ACATGCTAGCTATAATAATGT
64Alg. Cerca exacta de molts patrons