Title: Online Approximate String Matching with Bounded Errors
1On-line Approximate String Matching with Bounded
Errors
- Marcos Kiwi, Gonzalo Navarro and Claudio Telha
- Universidad de Chile
CPM 08
2Outline
The Approximate String Matching (ASM),
applications and state of art
The ASM with bounded errors problem, and one
algorithm that solves it
Complexity results of the proposed algorithms.
Open questions and final comments
3Introduction
The Approximate String Matching (ASM),
applications and state of art
4- Approximate String Matching Problem (ASM)
- Find the positions where a pattern P appears
approximately in a text T
Example (Allowing one mismatch)
ACT ACTGATAACGTTAG
Pattern
Text
5Sequence alignment of DNA and Proteins
Search Engines
Unreliable transmission of data
Data Mining
Information Retrieval
6- State of art in ASM
- There are algorithms theoretically optimal and
simultaneously efficient in practice
When approximation edit distance
1985 Optimal worst case, but exponential space
on P
1994 Optimal on average, less practical (Chang
Marr)
2004 Optimal on average, practical (Fredriksson
Navarro)
Average Characters of P and T chosen uniformly
and independently from an
alphabet ?.
Most of the key ASM algorithmic questions have
been solved
7Our method
The ASM with bounded errors problem, and one
algorithm that solves it
8- Our proposal ASM with Bounded Errors
- Break the average lower bound by allowing an
additional factor of error
Not bad deal, ASM is already an approximate model
ASM with bounded errors (ASM-BE)
Algorithms for ASM-BE are allowed to miss each
occurrence with probability at most e
New parameter Error threshold 0 e 1
9 Typical execution of ASM-BE algorithms
ASM-BE algorithms are allowed to miss each
occurrence with probability at most e
ASM algorithm (One mismatch allowed)
ACT ACTGATAACGTTAG
ASM-BE algorithm First execution
ACT ACTGATAACGTTAG
10 Typical execution of ASM-BE algorithms
ASM-BE algorithms are allowed to miss each
occurrence with probability at most e
ASM algorithm (One mismatch allowed)
ACT ACTGATAACGTTAG
ASM-BE algorithm Second execution
ACT ACTGATAACGTTAG
11 Typical execution of ASM-BE algorithms
ASM-BE algorithms are allowed to miss each
occurrence with probability at most e
ASM algorithm (One mismatch allowed)
ACT ACTGATAACGTTAG
ASM-BE algorithm Third execution
ACT ACTGATAACGTTAG
12 Main Contributions
- Break the average lower bound by allowing an
additional factor of error
Theoretically interesting, and potentially
practical
Not bad deal, since ASM is already an approximate
model
Novel approach for ASM
New framework for ASM, with room for improvements
ASM non-exact algorithm with probabilistic
guarantee
Easy to extend other distances, Multiple-ASM
13 Some notation and facts
Edit Distance
Minimum number of insertions, deletions and
substitutions needed to make two strings equals
edit(fellow, follows)2 (fellow ? follow ?
follows)
n is the length of text T m is the length of the
pattern P k is the maximum edit distance we allow
in a match s is the size of the alphabet S
14 Some notation and facts
ASM results (Worst case time)
Lower bound O(n)
Achievable with automata, but space is
exponential in m
Complexity not known when polynomial space is
required
ASM results (Average case time)
Lower bound O( ((klogs m)n/m )
Achievable with filtering algorithms
From practical perspective, average case is
better measure
This is the on-line setting text cannot be
preprocessed
15 Building an ASM-BE algorithm
- Filter algorithms for ASM usually lead to an
ASM-BE algorithm
Conceptually, an ASM filter is a two stage process
Intuitively, they work well on average because in
this case its easy to say a pattern doesnt
appear
16 How a filter works
- First, the filtering sub-process discard areas
where the pattern cannot appear
Filtering process
Text
Filter sub-process is faster than any traditional
ASM algorithm
Pattern cannot appear in discarded areas
17 How a filter works
- Second, the verification sub-process find the
matches in non-discarded areas
Verification process
P
P
Text
Verification sub-process is any non-filtering
algorithm
Actually, filtering and verification process runs
in parallel
18 An ASM filter The q-grams Algorithm
More definitions
A q-gram is a string of length q
A window is a substring of the text (usually, the
size is O(m))
The set of q-grams of a string S are all the
S-q1 substrings of S of length q. Multiplicity
counts.
Ukkonen, in 1992, uses q-grams as the base of a
filter algorithm
Running time is O(n) on average
19 An ASM filter The q-grams Algorithm
Key ideas of the algorithm
Pattern and occurrence always share (m-q1-kq)
q-grams On average any window and pattern share
less q-grams
Why kq ?
Occurrence
Affected q-grams (k at most)
Its q-grams
20 An ASM filter The q-grams Algorithm
The q-grams algorithm
Filter process counts the q-grams shared with the
pattern, for every window of size m-k
Verification process is only executed to find any
occurrence containing those windows with at least
m-q1-kq q-grams
Average case analysis
Filter process can be implemented in O(n) time (
Naïve O(nq) )
Verification process takes O(m2) per verified
window
Windows are verified with O(1/m2) probability,
when q2 logs m
Total time O(n)
21 Building an ASM-BE algorithm
- Filter algorithms for ASM usually lead to an
ASM-BE algorithm
-BE
Conceptually, an ASM filter is a two stage
process
Probabilistic
probably
Essentially, the only difference is that the
filter can fail and discard some occurrences
22 A small change to the q-grams algorithm
Windows are now disjoint and length is (m-k)/2
Any window included in an occurrence has at least
(m-k)/2 q 1 kq q-grams
Any occurrence necessarily include at least one
window
Current Window
Next Window
Windows are now disjoint and length is (m-k)/2
Text
q-grams filter
Discard
Go to next window
Verify
Verification process
23 A small change to the q-grams algorithm
The q-grams algorithm modified (still ASM
oriented)
Partition the text in disjoint windows of size
(m-k)/2
Filter counts the q-grams shared with the pattern
per window
Verification process finds any occurrence
containing windows with at least (m-k)/2-q1-kq
q-grams
Average case analysis remains the same
The filter process is still O(n) time
Verification process still takes O(m2) per
verified window
Windows are verified with O(1/m2) probability,
when q2 logs m
Total time O(n)
24 The ASM-BE q-grams algorithm
- We approximate the q-grams counting by using
random sampling
The q-grams algorithm for ASM-BE
Partition the text in disjoint windows of size
(m-k)/2
Filter estimates the fraction of the q-grams
shared with the pattern per window
Verification process finds any occurrence
containing windows with an estimated fraction of
shared q-grams at least ?
Fraction of shared q-grams is estimated by
randomly choosing c q-grams of the window and
obtaining the fraction of shared q-grams on this
subset.
25Results
Complexity results of the proposed algorithms.
26 The ASM-BE algorithms
Complexity of ASM is O( ((klogs m)n/m )
27Conclusion
Open questions and future work
28- What we did?
- We just scratched the surface of this new area
Introduced the framework of ASM with Bounded
Errors
Proved the existence of natural algorithms for
this framework, that are easy to extend to
related problems (Multiple-patterns string
matching, for example)
Showed that it is possible to break the lower
bound for ASM in this framework, with a
reasonable error probability
29So far the error e is mainly a dependent
parameter. Is there a natural way to introduce
this parameter in the algorithms?
What are the best possible tradeoffs (error v/s
time) that we can achieve?
30 31(No Transcript)
32(No Transcript)