Recuperaci de la informaci - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Recuperaci de la informaci

Description:

String matching: definition of the problem (text,pattern) ... The patterns --- Data structures for the patterns. Dynamic programming ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 58
Provided by: lcl2
Category:

less

Transcript and Presenter's Notes

Title: Recuperaci de la informaci


1
Recuperació de la informació
  • Modern Information Retrieval (1999)
  • Ricardo-Baeza Yates and Berthier Ribeiro-Neto
  • Flexible Pattern Matching in Strings (2002)
  • Gonzalo Navarro and Mathieu Raffinot
  • http//www-igm.univ-mlv.fr/lecroq/string/index.ht
    ml

Algorismes de Cerca de patrons (exacta i
aproximada) (String matching i Pattern
matching) Indexació de textos Suffix
trees, Suffix arrays
2
String Matching
String matching definition of the problem
(text,pattern)
depends on what we have text or patterns
  • Exact matching
  • The patterns ---gt Data structures for the
    patterns
  • 1 pattern ---gt The algorithm depends on p and
    ?
  • k patterns ---gt The algorithm depends on k, p
    and ?
  • Extensions
  • Regular Expressions
  • The text ----gt Data structure for the text
    (suffix tree, ...)
  • Approximate matching
  • Dynamic programming
  • Sequence alignment (pairwise and multiple)
  • Sequence assembly hash algorithm
  • Probabilistic search

Hidden Markov Models
3
Exact string matching one pattern (text on-line)
Experimental efficiency (Navarro Raffinot)
BNDM Backward Nondeterministic Dawg Matching
?
BOM Backward Oracle Matching
64
32
16
Horspool
8
BOM
BNDM
4
Long. pattern
2
w
2 4 8 16
32 64 128
256
4
Multiple string matching
5
Trie
Construct the trie of GTATGTA,GTAT,TAATA,GTGTA
6
Trie
Construct the trie of GTATGTA,GTAT,TAATA,GTGTA
7
Trie
Construct the trie of GTATGTA,GTAT,TAATA,GTGTA
8
Trie
Construct the trie of GTATGTA,GTAT,TAATA,GTGTA
G
T
T
A
A
T
G
T
A
A
A
T
A
9
Trie
Construct the trie of GTATGTA,GTAT,TAATA,GTGTA
G
T
T
A
A
T
G
G
T
A
T
A
A
A
T
Which is the cost?
10
Set Horspool algorithm
  • How the comparison is made?

By suffixes
Text
Patterns
Trie of all inverse patterns
  • Which is the next position of the window?

a
We shift until a is aligned with the first a in
the trie not longer than lmin, or lmin
11
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
2. Determine lmin
12
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
2. Determine lmin4
13
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
2. Determine lmin4
14
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
2. Determine lmin4
4. Find the patterns
15
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACA
16
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACA
17
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACA
18
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACA
19
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACA
20
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACA
21
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACA

22
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACA

Is the expected length of the shifts related
with the number of patterns?
23
Set Horspool algorithm ?Wu-Manber algorithm
How the length of shifts can be increased?
By reading blocks of symbols instead of only
one! Given ATGTATG,TATG,ATAAT,ATGTG
24
Set Horspool algorithm ?Wu-Manber algorithm
How the length of shifts can be increased?
By reading blocks of symbols instead of only
one! Given ATGTATG,TATG,ATAAT,ATGTG
3
25
Set Horspool algorithm ?Wu-Manber algorithm
How the length of shifts can be increased?
By reading blocks of symbols instead of only
one! Given ATGTATG,TATG,ATAAT,ATGTG
3
1
26
Set Horspool algorithm ?Wu-Manber algorithm
How the length of shifts can be increased?
By reading blocks of symbols instead of only
one! Given ATGTATG,TATG,ATAAT,ATGTG
3
3 3 3
27
Set Horspool algorithm ?Wu-Manber algorithm
How the length of shifts can be increased?
By reading blocks of symbols instead of only
one! Given ATGTATG,TATG,ATAAT,ATGTG
28
Wu-Manber algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACATAATA
29
Wu-Manber algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACATAATA
30
Wu-Manber algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACATAATA
31
Wu-Manber algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
text ACATGCTATGTGACATAATA
But given k patterns, how many symbols we should
take ?

logS 2lmink
32
Multiple string matching
33
BOM algorithm (Backward Oracle Matching)
The position determined by the last character of
the text
with a
transition in the automata
34
Factor Oracle of k strings
How can we build the Factor Oracle of GTATGTA,
GTAA, TAATA i GTGTA ?
G
G
A
T
T
T
A
T
G
1,4
A
A
T
A
3
2
35
Factor Oracle of k strings
How can we build the Factor Oracle of GTATGTA,
GTAA, TAATA i GTGTA ?
G
G
A
T
T
T
A
T
G
1,4
A
A
T
A
3
2
36
Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
G
37
Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
G
T
T
38
Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
G
A
T
T
A
39
Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
G
A
T
T
T
A
40
Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
G
G
A
T
T
G
T
A
41
Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
G
G
A
T
T
T
G
T
A
we insert GTAA
42
Factor Oracle of k strings
inserting GTAA
G
G
A
T
T
T
A
G
T
1
A
43
Factor Oracle of k strings
Given the AFO of GTATGTA and GTAA
G
G
A
T
T
T
A
G
T
1
A
A
2
we insert TAATA
44
Factor Oracle of k strings
inserting TAATA
G
G
A
T
T
T
A
G
T
1
A
A
2
45
Factor Oracle of k strings
Given the AFO of GTATGTA, GTAA and TAATA
G
G
A
T
T
T
A
G
T
1
A
A
T
A
3
2
we insert GTGTA
46
Factor Oracle of k strings
inserting GTGTA
G
G
A
T
T
T
A
G
T
1
A
A
T
A
3
2
47
Factor Oracle of k strings
G
G
A
T
T
T
A
G
T
1,4
A
A
T
A
3
2
This is the Automata Factor Oracle of GTATGTA,
GTAA, TAATA and GTGTA
48
SBOM algorithm
The position determined by the last character of
the text
with a
transition in the automata
49
SBOM algorithm example
We search for the patterns ATGTATG,
TAATG,TAATAAT i AATGTG
the we build the Automata Factor Oracle of
GTATG, GTAAT, TAATA and GTGTA of length lmin5
G
G
A
T
T
T
A
1
4
A
G
T
A
A
T
2
3
50
SBOM algorithm example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
text ACATGCTAGCTATAATAATGTATG
51
SBOM algorithm example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
text ACATGCTAGCTATAATAATGTATG
52
SBOM algorithm example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
text ACATGCTAGCTATAATAATGTATG
53
SBOM algorithm example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
text ACATGCTAGCTATAATAATGTATG
54
SBOM algorithm example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
text ACATGCTAGCTATAATAATGTATG
55
SBOM algorithm example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
text ACATGCTAGCTATAATAATGTATG
56
SBOM algorithm example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
text ACATGCTAGCTATAATAATGTATG
57
SBOM algorithm example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
G
G
A
T
T
T
A
1
4
A
G
T
A
A
A
T
2
3
text ACATGCTAGCTATAATAATGT
58
Multiple string matching
Write a Comment
User Comments (0)
About PowerShow.com