Title: Recuperaci
1Recuperació de la informació
- Modern Information Retrieval (1999)
- Ricardo-Baeza Yates and Berthier Ribeiro-Neto
- Flexible Pattern Matching in Strings (2002)
- Gonzalo Navarro and Mathieu Raffinot
- Algorithms on strings (2001)
- M. Crochemore, C. Hancart and T. Lecroq
- http//www-igm.univ-mlv.fr/lecroq/string/index.ht
ml
2String Matching
String matching definition of the problem
(text,pattern)
depends on what we have text or patterns
- The patterns ---gt Data structures for the
patterns
- 1 pattern ---gt The algorithm depends on p and
?
- k patterns ---gt The algorithm depends on k, p
and ?
- The text ----gt Data structure for the text
(suffix tree, ...)
- Sequence alignment (pairwise and multiple)
- Sequence assembly hash algorithm
Hidden Markov Models
3Index
1a. Part Suffix trees Algorithms on strings,
trees and sequences, Dan Gusfield Cambridge
University Press
2a. Part Suffix arrays Suffix-arrays a new
method for on-line string searches, G.
Myers, U. Manber
4Suffix trees
Given string ababaas
Suffixes
3 abaas
1 ababaas
4 baas
2 babaas
What kind of queries?
5Applications of Suffix trees
1. Exact string matching
- Does the sequence ababaas contain any ocurrence
of patterns abab, aab, and ab?
6Quadratic insertion algorithm
Invariant Properties
Given the string ......
...
P1 the leaves of suffixes from ? have been
inserted
7Quadratic insertion algorithm
Given the string ababaabbs
8Quadratic insertion algorithm
Given the string ababaabbs
ababaabbs,1
9Quadratic insertion algorithm
Given the string ababaabbs
ababaabbs,1
babaabbs,2
10Quadratic insertion algorithm
Given the string ababaabbs
babaabbs,2
11Quadratic insertion algorithm
Given the string ababaabbs
babaabbs,2
12Quadratic insertion algorithm
Given the string ababaabbs
13Quadratic insertion algorithm
Given the string ababaabbs
ba
baabbs,2
14Quadratic insertion algorithm
Given the string ababaabbs
ba
baabbs,2
15Quadratic insertion algorithm
Given the string ababaabbs
ba
baabbs,2
16Quadratic insertion algorithm
Given the string ababaabbs
ba
ba
baabbs,2
17Quadratic insertion algorithm
Given the string ababaabbs
ba
baabbs,2
18Quadratic insertion algorithm
Given the string ababaabbs
ba
baabbs,2
19Quadratic insertion algorithm
Given the string ababaabbs
20Quadratic insertion algorithm
Given the string ababaabbs
21Quadratic insertion algorithm
Given the string ababaabbs
22Generalizad suffix tree
The suffix tree of many strings
is called the generalized suffix tree
and it is the suffix tree of the concatenation
of strings.
For instance,
23Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
Given the suffix tree of ababaaba
24Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
25Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
ab
a
ba,5
26Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
ab
a
ba,5
27Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
aaß,1
ab
a
ba,5
b
a
bba,3
a
baabba,1
28Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
aaß,1
ab
a
ba,5
b
a
bba,3
a
baabba,1
29Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
aaß,1
ab
a
ba,5
ß,2
b
a
bba,3
a
b
baabba,1
a
a
bba,4
baabba,2
30Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
aaß,1
ab
a
ba,5
ß,2
b
a
bba,3
a
b
baabba,1
a
a
bba,4
baabba,2
31Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
aaß,1
a
b
a
ba,5
ß,2
b
a
bba,3
a
b
baabba,1
ß,3
a
a
bba,4
baabba,2
32Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
aaß,1
a
b
a
ba,5
ß,2
b
a
bba,3
a
b
baabba,1
ß,3
a
a
bba,4
baabba,2
33Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
ß,4
ß,4
aaß,1
a
b
a
ba,5
ß,2
b
a
bba,3
a
b
baabba,1
ß,3
a
a
bba,4
baabba,2
34Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
ß,4
ß,4
aaß,1
a
b
a
ba,5
ß,2
b
a
bba,3
a
b
baabba,1
ß,3
a
a
bba,4
baabba,2
35Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
ß,4
ß,4
aaß,1
ß,4
a
b
a
ba,5
ß,2
b
a
bba,3
a
b
baabba,1
ß,3
a
a
bba,4
baabba,2
36Generalizad suffix tree
Generalized suffix tree of ababaabbaaabaaß
37Applications of Generalized Suffix trees
1. The substring problem for a database of
strings DB
- Does the DB contain any ocurrence of patterns
abab, aab, and ab?
38Applications of Generalized Suffix trees
2. The longest common substring of two strings
nbsp
39Definition of MUM
Matching
Unique
Maximal
40Applications of Generalized Suffix trees
3. Finding MUMs.
41Quadratic insertion algorithm
Invariant Properties
Given the string ......
...
P1 the leaves of suffixes from ? have been
inserted
42Linear insertion algorithm
Invariant Properties
Given the string ......
P1 the leaves of suffixes from ? have been
inserted
P2 the string ? is the longest string that can
be spelt through the tree.
43Linear insertion algorithm example
Given the string ababaababb...
44Linear insertion algorithm example
Given the string ababaababb...
6 7 8
45Linear insertion algorithm example
?
Given the string ababaababb...
6 7 8
?
46Linear insertion algorithm example
?
Given the string ababaababb...
6 7 89
?
47Linear insertion algorithm example
48Linear insertion algorithm example
49Linear insertion algorithm example
50Linear insertion algorithm example
ababb...,5
ababb...,3
ba
ba
ababb...,4
baababb...,2
51Linear insertion algorithm example
ababb...,5
ababb...,3
ba
ba
ababb...,4
b
aababb...,2
baababb...,2
baababb...,2
52Linear insertion algorithm example
?
Given the string ababaababb...
7 8
?
ababb...,5
ababb...,3
ba
ba
ababb...,4
baababb...,2
53Linear insertion algorithm example
?
Given the string ababaababb...
89
?
ababb...,5
ababb...,3
ba
ba
ababb...,4
54Linear insertion algorithm example
?
Given the string ababaababb...
89
?
ababb...,5
ababb...,3
ba
ba
ababb...,4
55Linear insertion algorithm example
?
Given the string ababaababb...
89
?
ababb...,5
ababb...,3
ba
ba
ababb...,4
56Linear insertion algorithm example
?
Given the string ababaababb...
89
?
ababb...,5
a
b
ba
ababb...,4
b
aababb...,2
b...,7
57Linear insertion algorithm example
?
Given the string ababaababb...
89
?
ababb...,5
a
b
b...,8
ba
ababb...,4
b
aababb...,2
b...,7
58Linear insertion algorithm example
Given the string ababaababb...
9
?
ababb...,5
a
b
b...,8
ba
ababb...,4
b
aababb...,2
b...,7
59Linear insertion algorithm example
Given the string ababaababb...
9
?
ababb...,5
a
b
b...,8
ba
ababb...,4
b
aababb...,2
b...,7
60Linear insertion algorithm example
Given the string ababaababb...
9
?
ababb...,5
a
b
b...,8
a
b
ababb...,4
b
aababb...,2
b...,7
61Linear insertion algorithm example
Given the string ababaababb...
9
?
ababb...,5
a
b
b...,8
a
b
ababb...,4
b...,9
b
aababb...,2
b...,7
62Linear insertion algorithm example
Given the string ababaababb...
9
?
ababb...,5
a
b
b...,8
a
b
ababb...,4
b...,9
b
aababb...,2
b...,7
63Linear insertion algorithm example
?
Given the string ababaababb...
9
?
ababb...,5
a
b
b...,8
a
b
ababb...,4
b...,9
b
aababb...,2
b...,7
64Linear insertion algorithm
Given the string ababaababs
65Linear insertion algorithm
Given the string ababaababs
66Linear insertion algorithm
Given the string ababaababs
67Linear insertion algorithm
Given the string ababaababs
68Linear insertion algorithm
Given the string ababaababs
69Linear insertion algorithm
Given the string ababaababs
70Linear insertion algorithm
Given the string ababaababs
71Linear insertion algorithm
Given the string ababaababs
72Linear insertion algorithm
Given the string ababaababs
73Index
1a. Part Suffix trees Algorithms on strings,
trees and sequences, Dan Gusfield Cambridge
University Press
2a. Part Suffix arrays Suffix-arrays a new
method for on-line string searches, G.
Myers, U. Manber
74Suffix arrays
Given string ababaa
1 ababaa
Suffixes
but lexicographically sorted
2 babaa
1
3 abaa
6 a
4 baa
5 aa
3 abaa
1 ababaa
4 baa
2 babaa
Which is the cost?
O(n log(n))
75Applications of suffix arrays
1. Exact string matching
- Does the sequence ababaas contain any ocurrence
of patterns abab, aab, and ab?
Binary search
76Search with cost O(log(n) P)
Invariant Properties
77Search with cost O(log(n) P)
Invariant Properties
Algorithm
If ?ltquery then a ?
else ß ?
Cost
O(log(n) P)
Can it be improved to O(log(n)P) ?
78Fast search with cost O(log(n)P)
Invariant Properties
79Fast search with cost O(log(n)P)
Suffix array
1 2 n
Invariant Properties
Algorithm
If xlty then a ? xgty then ß ? xy
then fi