Title: Recuperaci de la informaci
1Recuperació de la informació
- Modern Information Retrieval (1999)
- Ricardo-Baeza Yates and Berthier Ribeiro-Neto
- Flexible Pattern Matching in Strings (2002)
- Gonzalo Navarro and Mathieu Raffinot
- http//www-igm.univ-mlv.fr/lecroq/string/index.ht
ml
Algorismes de Cerca de patrons (exacta i
aproximada) (String matching i Pattern
matching) Indexació de textos Suffix
trees, Suffix arrays
2String Matching
String matching definition of the problem
(text,pattern)
depends on what we have text or patterns
- The patterns ---gt Data structures for the
patterns
- 1 pattern ---gt The algorithm depends on p and
?
- k patterns ---gt The algorithm depends on k, p
and ?
- The text ----gt Data structure for the text
(suffix tree, ...)
- Sequence alignment (pairwise and multiple)
- Sequence assembly hash algorithm
Hidden Markov Models
3Exact string matching one pattern (text on-line)
Experimental efficiency (Navarro Raffinot)
BNDM Backward Nondeterministic Dawg Matching
?
BOM Backward Oracle Matching
64
32
16
Horspool
8
BOM
BNDM
4
Long. pattern
2
w
2 4 8 16
32 64 128
256
4Multiple string matching
5Trie
Construct the trie of GTATGTA,GTAT,TAATA,GTGTA
6Trie
Construct the trie of GTATGTA,GTAT,TAATA,GTGTA
7Trie
Construct the trie of GTATGTA,GTAT,TAATA,GTGTA
8Trie
Construct the trie of GTATGTA,GTAT,TAATA,GTGTA
G
T
T
A
A
T
G
T
A
A
A
T
A
9Trie
Construct the trie of GTATGTA,GTAT,TAATA,GTGTA
G
T
T
A
A
T
G
G
T
A
T
A
A
A
T
Which is the cost?
10Set Horspool algorithm
- How the comparison is made?
By suffixes
Text
Patterns
Trie of all inverse patterns
- Which is the next position of the window?
a
We shift until a is aligned with the first a in
the trie not longer than lmin, or lmin
11Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
2. Determine lmin
12Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
2. Determine lmin4
13Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
2. Determine lmin4
14Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
2. Determine lmin4
4. Find the patterns
15Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACA
16Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACA
17Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACA
18Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACA
19Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACA
20Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACA
21Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACA
22Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACA
Is the expected length of the shifts related
with the number of patterns?
23Set Horspool algorithm ?Wu-Manber algorithm
How the length of shifts can be increased?
By reading blocks of symbols instead of only
one! Given ATGTATG,TATG,ATAAT,ATGTG
24Set Horspool algorithm ?Wu-Manber algorithm
How the length of shifts can be increased?
By reading blocks of symbols instead of only
one! Given ATGTATG,TATG,ATAAT,ATGTG
3
25Set Horspool algorithm ?Wu-Manber algorithm
How the length of shifts can be increased?
By reading blocks of symbols instead of only
one! Given ATGTATG,TATG,ATAAT,ATGTG
3
1
26Set Horspool algorithm ?Wu-Manber algorithm
How the length of shifts can be increased?
By reading blocks of symbols instead of only
one! Given ATGTATG,TATG,ATAAT,ATGTG
3
3 3 3
27Set Horspool algorithm ?Wu-Manber algorithm
How the length of shifts can be increased?
By reading blocks of symbols instead of only
one! Given ATGTATG,TATG,ATAAT,ATGTG
28Wu-Manber algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACATAATA
29Wu-Manber algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACATAATA
30Wu-Manber algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACATAATA
31Wu-Manber algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACATAATA
But given k patterns, how many symbols we should
take ?
logS 2lmink
32Multiple string matching
33BOM algorithm (Backward Oracle Matching)
The position determined by the last character of
the text
with a
transition in the automata
34Factor Oracle of k strings
How can we build the Factor Oracle of GTATGTA,
GTAA, TAATA i GTGTA ?
G
G
A
T
T
T
A
T
G
1,4
A
A
T
A
3
2
35Factor Oracle of k strings
How can we build the Factor Oracle of GTATGTA,
GTAA, TAATA i GTGTA ?
G
G
A
T
T
T
A
T
G
1,4
A
A
T
A
3
2
36Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
G
37Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
G
T
T
38Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
G
A
T
T
A
39Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
G
A
T
T
T
A
40Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
G
G
A
T
T
G
T
A
41Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
G
G
A
T
T
T
G
T
A
we insert GTAA
42Factor Oracle of k strings
inserting GTAA
G
G
A
T
T
T
A
G
T
1
A
43Factor Oracle of k strings
Given the AFO of GTATGTA and GTAA
G
G
A
T
T
T
A
G
T
1
A
A
2
we insert TAATA
44Factor Oracle of k strings
inserting TAATA
G
G
A
T
T
T
A
G
T
1
A
A
2
45Factor Oracle of k strings
Given the AFO of GTATGTA, GTAA and TAATA
G
G
A
T
T
T
A
G
T
1
A
A
T
A
3
2
we insert GTGTA
46Factor Oracle of k strings
inserting GTGTA
G
G
A
T
T
T
A
G
T
1
A
A
T
A
3
2
47Factor Oracle of k strings
G
G
A
T
T
T
A
G
T
1,4
A
A
T
A
3
2
This is the Automata Factor Oracle of GTATGTA,
GTAA, TAATA and GTGTA
48SBOM algorithm
The position determined by the last character of
the text
with a
transition in the automata
49SBOM algorithm example
We search for the patterns ATGTATG,
TAATG,TAATAAT i AATGTG
the we build the Automata Factor Oracle of
GTATG, GTAAT, TAATA and GTGTA of length lmin5
G
G
A
T
T
T
A
1
4
A
G
T
A
A
T
2
3
50SBOM algorithm example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
text ACATGCTAGCTATAATAATGTATG
51SBOM algorithm example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
text ACATGCTAGCTATAATAATGTATG
52SBOM algorithm example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
text ACATGCTAGCTATAATAATGTATG
53SBOM algorithm example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
text ACATGCTAGCTATAATAATGTATG
54SBOM algorithm example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
text ACATGCTAGCTATAATAATGTATG
55SBOM algorithm example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
text ACATGCTAGCTATAATAATGTATG
56SBOM algorithm example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
text ACATGCTAGCTATAATAATGTATG
57SBOM algorithm example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
text ACATGCTAGCTATAATAATGT
58Multiple string matching