Master Course - PowerPoint PPT Presentation

About This Presentation
Title:

Master Course

Description:

H15: Algorithms on strings and sequences ... Flexible pattern matching in strings. G. Navarro and M. Raffinot, 2002, Cambridge Uni. Press ... – PowerPoint PPT presentation

Number of Views:160
Avg rating:3.0/5.0
Slides: 62
Provided by: lcl2
Category:
Tags: course | master | string

less

Transcript and Presenter's Notes

Title: Master Course


1
Master Course
MSc Bioinformatics for Health Sciences H15
Algorithms on strings and sequences Xavier
Messeguer Peypoch (http//www.lsi.upc.es/alggen)
Dep. de Llenguatges i Sistemes
Informàtics CEPBA-IBM Research Institute Universi
tat Politècnica de Catalunya
2
Contents
1. (Exact) String matching of one pattern
2. (Exact) String matching of many patterns
3. Approximate string matching (Dynamic
programming)
4. Pairwise and multiple alignment
5. Suffix trees
3
Contents and bibliography
1. (Exact) String matching of one pattern
2. (Exact) String matching of many patterns
3. Approximate string matching (Dynamic
programming)
4. Pairwise and multiple alignment
5. Suffix trees
  • Flexible pattern matching in strings
  • G. Navarro and M. Raffinot, 2002, Cambridge Uni.
    Press
  • Algorithms on strings, trees and sequences
  • D. Gusfield, Cambridge University Press, 1997

4
String matching
Definition given a long text T and a set of k
patterns p1,p2,,pk, the string matching problem
is to find all the ocurrences of all the
patterns in the text T.
On-line algorithms the patterns are
known. Off-line algorithms the text is known.
  • Only one pattern (exact and approximated)
  • Five, ten, hundred, thusand,.. patterns (exact)
  • Extended patterns
  • Suffix trees

5
Master Course
First lecture First part (Exact) string
matching of one pattern
6
String matching one pattern
How does the string algorithms made the search?
For instance, given the sequence CTACTACTACGTCTAT
ACTGATCGTAGCTACTACATGC search for the pattern
ACTGA.
and for the pattern TACTACGGTATGACTAA
7
String Matching Brute force algorithm
Example
Given the pattern ATGTA, the search is
G T A C T A G A G G A C G T A T G T A C T G ...
8
String Matching Brute force algorithm
  • Connect to
  • http//www-igm.univ-mlv.fr/lecroq/string/index.ht
    ml
  • and open Brute Force algorithm

What is the meaning of the variables? y
n x
m
What is the meaning of the variables? y array
with the text T n length of the text x
array with the pattern P mlength of the
pattern
C code of the running file
Connect to http//www.lsi.upc.edu/peypoch
9
String Matching of one pattern
  • The cost of Brute Force algorithm is O(nm).

Can the search be made with lower cost?
CTACTACTACGTCTATACTGATCGTAGCTACTACATGC
TACTACGGTATGACTAA
10
String matching of one pattern
How does the string algorithms made the search?
At each step the comparison is made and the
window is shifted to the right.
Which are the facts that differentiate the
algorithms?
  1. How the comparison is made.
  2. The length of the shift.

11
String Matching Brute force algorithm
The cost is O(mn).
12
String Matching one pattern
Most efficient algorithms (Navarro Raffinot)
BNDM Backward Nondeterministic Dawg Matching
?
BOM Backward Oracle Matching
64
32
16
Horspool
8
BOM
BNDM
4
2
w
Length of the pattern
2 4 8 16
32 64 128
256
13
String Matching Horspool algorithm
It depends of where appears the last letter of
the text, say it a, in the pattern
Then it is necessary a preprocess that determines
the length of the shift.
14
String Matching Horspool algorithm
Example
Given the pattern ATGTA, the shift table is
15
String Matching Horspool algorithm
Example
Given the pattern ATGTA, the shift table is

16
String Matching Horspool algorithm
  • Connect to
  • http//www-igm.univ-mlv.fr/lecroq/string/index.ht
    ml
  • and open the Horspool algorithm

C code
Connect to http//www.lsi.upc.edu/peypoch
17
String Matching one pattern
The most efficient algorithms (Navarro Raffinot)
BNDM Backward Nondeterministic Dawg Matching
?
BOM Backward Oracle Matching
64
32
16
Horspool
8
BOM
BNDM
4
2
w
Length of the pattern
2 4 8 16
32 64 128
256
18
BNDM algorithm
x
How the next state can be obtained?
?
19
BNDM algorithm example
Given the pattern ATGTA,
20
BNDM algorithm example
Given the pattern ATGTA,
21
BNDM algorithm example
Given the pattern ATGTA,
D1
( 0 1 0 1 0 )
D2 ( 1 0 1 0 0 ) ( 0 0 0 0 0 ) ( 0 0 0 0 0 )
D1
( 0 0 1 0 0 )
D2 ( 0 1 0 0 0 ) ( 0 0 1 0 0 ) ( 0 0 0 0 0 )
D1
( 1 0 0 0 1 )
D2 ( 0 0 0 1 0 ) ( 0 1 0 1 0 ) ( 0 0 0 1 0 )
D3 ( 0 0 1 0 0 ) ( 0 0 1 0 0) ( 0 0 1 0 0 )
D4 ( 0 1 0 0 0 ) ( 0 0 0 0 0) ( 0 0 0 0 0 )
22
BNDM algorithm example
D1
( 1 0 0 0 1 )
D2 ( 0 0 0 1 0 ) ( 0 1 0 1 0 ) ( 0 0 0 1 0 )
D3 ( 0 0 1 0 0 ) ( 0 0 1 0 0 ) ( 0 0 1 0 0 )
D4 ( 0 1 0 0 0 ) ( 0 1 0 1 0 ) ( 0 1 0 0 0 )
D5 ( 1 0 0 0 0 ) ( 1 0 0 0 1 ) ( 1 0 0 0 0 )
D6 ( 0 0 0 0 0 ) ( ) ( 0 0 0 0
0 )
Pattern found!

23
BNDM algorithm
?
24
BNDM algorithm
If the left bit is set to one in step i, it means
that a prefix of P of length i is equal to a
suffix of T, then the window is shifted m-i
cells otherwise it is shifted m cells
25
String matching one pattern
The most efficient algorithms (Navarro Raffinot)
BNDM Backward Nondeterministic Dawg Matching
?
BOM Backward Oracle Matching
64
32
16
Horspool
8
BOM
BNDM
4
Long. patró
2
w
2 4 8 16
32 64 128
256
26
BOM (Backward Oracle Matching)
?
27
Automaton Factor Oracle properties
Factor Oracle of the word G T A T G T A
G T A T
G
T A T
G
A T
G
T G
G
but the automaton also recognizes other strings
as G T G
then it is usefull only for discard words out as
factors!
28
BOM example
  • How the comparison is made?
  • The Factor Oracle of the inverted pattern is
    built. Given the pattern ATGTATG

A T G T A T G
29
BOM example
  • How the comparison is made?
  • The Factor Oracle of the inverted pattern is
    built. Given the pattern ATGTATG

A T G T A T G
30
BOM example
  • How the comparison is made?
  • The Factor Oracle of the inverted pattern is
    built. Given the pattern ATGTATG

A T G T A T G
31
BOM example
  • How the comparison is made?
  • The Factor Oracle of the inverted pattern is
    built. Given the pattern ATGTATG

A T G T A T G
32
BOM example
  • How the comparison is made?
  • The Factor Oracle of the inverted pattern is
    built. Given the pattern ATGTATG

A T G T A T G
33
BOM example
  • How the comparison is made?
  • Es construeix lautòmata del patró invers
    Suposem que el patró és ATGTATG

A T G T A T G

34
BOM (Backward Oracle Matching)
35
String Matching BNDM and BOM
  • Connect to
  • http//www-igm.univ-mlv.fr/lecroq/string/index.ht
    ml
  • and open the BNDM and BOM algorithms

C code of BNDM C code of BOM
36
Master Course
First lecture Second part (Exact) string
matching of many patterns
37
String matching many patterns
Given the sequence CTACTACTACGTCTATACTGATCGTAGCTA
CTACATGC
Search for the patterns ACTGACT GTCT AATT ACTGATC
TTT GTAGC AATACT ACATGC ACTGA.
38
Trie
Trie of words GTATGTA,GTAT,TAATA,GTGTA
G
T
T
A
A
T
G
G
T
A
T
A
A
A
T
A
Which is the cost?
39
Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
2. lmin4
4. Start the search
40
Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
The text ACATGCTATGTGACA
41
Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
The text ACATGCTATGTGACA
42
Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
The text ACATGCTATGTGACA
43
Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
The text ACATGCTATGTGACA
44
Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
The text ACATGCTATGTGACA
45
Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
The text ACATGCTATGTGACA

46
Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
The text ACATGCTATGTGACA

47
Horspool to Wu-Manber
How do we can increase the length of the shifts?
With a table shift of l-mers with the patterns
ATGTATG,TATG,ATAAT,ATGTG
48
Wu-Manber algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
into the text ACATGCTATGTGACATAATA

Experimental length logS 2lminr
49
String matching of many patterns
50
String matching of many patterns
?
(5 patterns)
8
Wu-Manber
4
SBOM
Lmin
2
5 10 15 20
25 30 35
40 45
8
Wu-Manber
(10 patterns)
4
SBOM
2
5 10 15 20
25 30 35
40 45
(1000 patterns)
SBOM
8
(100 patterns)
4
2
5 10 15 20
25 30 35
40 45
51
Horspool for a set of patterns
  • How the comparison is made?

Comparison
Text
Patrons
Automaton with all the patterns
  • How the shift is determined?

a
Segons laparició de lúltim carácter del text
ainto the s patrons, concretament la primera
aparició per la dreta no última i més curta que
lmin, o lmin
52
String matching of many patterns
53
SBOM
?
54
Factor Oracle of many patterns
G
G
A
T
T
T
A
G
T
1,4
A
A
T
A
3
2
The AFO of GTATGTA, GTAA, TAATA i GTGTA
55
SBOM algorithm
  • How the comparison is made?

Text
Patrons
Autòmaton of lenght lmin
  • How the shift is determined?

a
  • If the a doesnt appears in the AFO
  • If lmin characters have been read

56
SBOM algorithm example
Search for the patterns ATGTATG, TAATG,TAATAAT i
AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
ACATGCTAGCTATAATAATGTATG
57
SBOM algorithm example
Search for the patterns ATGTATG, TAATG,TAATAAT i
AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
ACATGCTAGCTATAATAATGTATG
58
SBOM algorithm example
Search for the patterns ATGTATG, TAATG,TAATAAT i
AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
ACATGCTAGCTATAATAATGTATG
59
SBOM algorithm example
Search for the patterns ATGTATG, TAATG,TAATAAT i
AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
ACATGCTAGCTATAATAATGTATG
60
SBOM algorithm example
Search for the patterns ATGTATG, TAATG,TAATAAT i
AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
ACATGCTAGCTATAATAATGTATG
61
SBOM algorithm example
Search for the patterns ATGTATG, TAATG,TAATAAT i
AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
ACATGCTAGCTATAATAATGTATG
62
SBOM algorithm example
Search for the patterns ATGTATG, TAATG,TAATAAT i
AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
ACATGCTAGCTATAATAATGTATG
63
SBOM algorithm example
Search for the patterns ATGTATG, TAATG,TAATAAT i
AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
ACATGCTAGCTATAATAATGT
64
Alg. Cerca exacta de molts patrons
Write a Comment
User Comments (0)
About PowerShow.com