Title: Bioinformatics Algorithms and Data Structures
1Bioinformatics Algorithms and Data Structures
- Chapter 12.2.4 k-difference Inexact Matching
- Lecturer Dr. Rose
- Slides by Dr. Rose
- February 15, 2007
2Overview
- k-difference inexact matching
- Concepts
- d-path
- Farthest-reaching d-path in a diagonal
- O(km) time and space solution
- Primer selection problem
- Formulations
- Exact matching primer
- Inexact matching primer
- k-difference primer
- O(km) time solution to k-difference primer problem
3Overview
- Exclusion methods fast expected time O(m)
- Partition approaches
- BYP algorithm
- Aho-Corasick exact matching algorithm
- Keyword trees
- Back to Aho-Corasick exact matching algorithm
- Algorithm for computing failure links
- Back to BYP algorithm
4K-difference Inexact Matching
- Like k-mismatch problem allows mismatches
- Harder than k-mismatch
- allows spaces
- End spaces in T are not counted
- P T can be vastly different
- ? cant focus on a 2k1 band centered around the
diagonal.
5K-difference Inexact Matching
- Defn
- Diagonals above the main diagonal are numbered 1
through m. Diagonal i starts in cell (0,i). - Diagonals below the main diagonal are numbered -1
through 1n. Diagonal -i starts in cell (i,0). - Row 0 is initialized to be all zeros.
- Recall T can have free end spaces
- Setting row 0 to be zeros allows the left end of
T to start after a gap without any cost.
6K-difference Inexact Matching
- Defn a d-path is a path that starts in row 0 and
specifies exactly d mismatches spaces. - Defn a d-path is a farthest-reaching in diagonal
i if it ends in diagonal i and the index of its
ending column c is ? the ending column of any
other d-path ending in diagonal i. - You can visualize this as a d-path that ends
farthest in diagonal i.
7K-difference Inexact Matching
- Approach
- Iterate (1?d ?k )
- find the farthest-reaching d-path for each
diagonal i, (-n ?i ? m) - The farthest-reaching d-path for diagonal i is
found from the farthest-reaching (d-1)-paths on
diagonals i-1, i and i1. - Observation and d-path reaching row n
corresponds to a d-difference occurrence of P in
T.
8K-difference Inexact Matching
- Observation a farthest reaching 0-path in
diagonal i is the longest match of Ti..m and
P1..n. - Q Why is this true?
- A 0-path means an exact match ? no deviation
from the diagonal that you start on. - Using suffix trees
- Build the suffix tree in linear time (linear in
m). - Retrieve farthest-reaching 0-paths in constant
time/path.
9K-difference Inexact Matching
- Q How do we find the farthest-reaching d-path on
diagonal i for d gt 0? - A The d-path for diagonal i depends on the
previously found (d-1)-paths on diagonals i-1, i
and i1. - The 3 cases are
- Path R1, the farthest-reaching (d-1)-path on
diagonal i1, followed by a vertical edge to
diagonal i.
10K-difference Inexact Matching
- Since R1 is a (d-1)-path on diagonal i1,
extending it by a vertical edge (adding a space
in T) to diagonal i makes it a d-path on diagonal
i.
11K-difference Inexact Matching
- The 2nd case is
- Path R2, the farthest-reaching (d-1)-path on
diagonal i-1, followed by a horizontal edge to
diagonal i. - Again extending a (d-1)-path into a d-path on
diagonal i.
12K-difference Inexact Matching
- Path R3, the farthest-reaching (d-1)-path on
diagonal i, followed by a diagonal edge
corresponding to a mismatch. - Again extending a (d-1)-path into a d-path on
diagonal i.
13K-difference Inexact Matching
- Each of R1, R2, and R3, is initially a
farthest-reaching (d-1)-path on diagonal i-1, i,
i1, respectively. - Each is extended by a space or a mismatch
resulting in a d-path on diagonal i. - Each is subsequently extended along diagonal i.
- The farthest-reaching d-path on diagonal i must
be one of these.
14k-differences Algorithm
- d 0
- / Calculate farthest-reaching 0-paths on
diagonals 0 through m / - For i0 to m
- Find the longest common extension between
P1..n and Ti..m - / calculate d-paths by extending (d-1)-paths R1,
R2, and R3 / - For d1 to k
- For i -n to m
- extend (d-1)-paths R1, R2, R3 on diagonals i-1,
i, i1 to diagonal i. - One of these is the farthest reaching d-path on
diagonal i. -
- A path reaching row n defines an inexact
match of P in T containing - at most k differences. The column in row n
indicates the end character in T.
15K-difference Inexact Matching
- Space analysis
- For each d and i, we need to store the location
of the ending farthest-reaching d-path. - d ranges from 0 to k.
- There are (nm) diagonals.
- ? O(km) space is required.
16K-difference Inexact Matching
- Time analysis
- Constant time to retrieve 3 (d-1)-paths for
particular d and i. - ? O(km) for this aspect (like k-differences
alignment) - Corresponding O(km) extensions of paths along
diagonal. - Each path extension is a maximal identical
substring in P T, i.e., a longest common
extension computation. - Using a suffix tree entails only constant time.
- Creating the suffix tree entails linear
processing of strings O(nm) - ? altogether O(nmkm) O(km)
17Primer (Probe) Selection Problem
- Problem start with two strings a and b (detailed
description on page 178-179). - Exact matching version ?j gt j0, find the
shortest substring g of a starting at aj s.t. g ?
b. - Can be solved in O(ab)
- Not too bad.
- Inexact matching version Given parameter p, ?j gt
j0, find the shortest substring g ? a starting at
aj that has edit distance at least g/p from any
substring in b.
18Primer (Probe) Selection Problem
- Inexact matching version Given parameter p, ?j gt
j0, find the shortest substring g ? a starting at
aj that has edit distance at least g?p from any
substring in b. - Q How much work is this?
- find the shortest prefix g of a with edit
distance at least g?p from any substring in b. - The naïve approach appears daunting.
- Lets look at a less intimidating formulation!
19Primer (Probe) Selection Problem
- Change g ? p to k
- Convert the inexact matching problem to a
k-differences problem. - This works out since in practice, g ? p must
fall in a small range for fixed p. - k-difference primer problem Given parameter k,
?j gt j0, find the shortest substring g ? a
starting at aj that has edit distance at least k
from any substring in b.
20Primer (Probe) Selection Problem
- Approach
- For each position j in a
- Find the shortest prefix of aj..n with edit
distance ? k from every substring in b. - Q How does this compare with the k-differences
inexact matching problem? - A It is the opposite problem.
- Find matches with at most k differences,
- versus
- Reject matches of prefixes of aj..n with
substrings of b with fewer than k differences.
21Primer (Probe) Selection Problem
- Solution
- Use k-differences algorithm.
- Use aj..n in the place of P.
- Use b in the place of T.
- Compute the farthest-reaching d-path, d k, in
each diagonal. - d-paths, d lt k, reaching row n, mean no solution
at j - Q Why?
- A a d-path, d lt k, indicates aj..n matches a
substring of b with fewer than k differences.
22Primer (Probe) Selection Problem
- Solution
- Only if no farthest-reaching (k-1)-paths reaches
row n can there be a primer at position j. - In particular, if no farthest-reaching
(k-1)-paths reaches row r lt n then aj..r is a
primer if r is the smallest row with this
property. - Repeat this approach for every potential starting
position j in a. - Analysis if a n and b m, then the
algorithm takes time O(knm).
23Exclusion Methods
- Q Can we improve on the Q(km) time we have seen
for k-mismatch and k-difference? - A On average, yes. (Are we quibbling?)
- We adopt a fast expected algorithm lt Q(km)
- ? the worst case may not be better than Q(km)
24Exclusion Methods
- Partition Idea exclude much of T from the search
- Preliminaries
- Let a S, where S is the alphabet used in P
and T. - Let n P , and m T .
- Defn. an approximate occurrence of P is an
occurrence with at most k mismatches or
differences. - General Partition algorithm three phases
- Partition phase
- Search Phase
- Check Phase
25Exclusion Methods
- Partition phase
- Partition either T or P into r-length regions
(depends on particular algorithm) - Search Phase
- Use exact matching to search T for r-length
intervals - These are potential targets for approximate
occurrences of P. - Eliminate as many intervals as possible.
- Check Phase
- Use approximate matching to check for an
approximate occurrence of P around each surviving
interval for the search phase.
26BYP Method
- BYP method has O(m) expected running time.
- Partition P into r-length regions, r ?n/(k1)?
- Q How many r-length regions of P are there?
- A k1, there may be an additional short region.
- Suppose there is a match of P T with at most k
differences. - Q What can we deduce about the corresponding
r-length regions? - AThere must be at least one r-length interval
that exactly matches.
27BYP Method
- BYP Algorithm
- Let P be the set of the first k1 substrings of
Ps partitioning. - Build a keyword tree for the set of patterns P.
- Use Aho-Corasik to find I, the set of starting
locations in T where a pattern in P occurs
exactly. - ..
- Oops! We havent talked about keyword trees or
Aho-Corasik. Sooooo lets do that now.
28Keyword Trees (section 3.4)
- Defn. The keyword tree for set P is a rooted
directed tree K satisfying - Each edge is labeled with one character
- Any two edges out of the same node have distinct
labels. - Every pattern Pi in P maps to some node v of K
s.t. the path from the root to v spells out Pi - Every leaf in K is mapped by some pattern in P.
29Keyword Trees
- Example From textbook P potato, poetry,
pottery, science, school
30Keyword Trees (section 3.4)
- Observation there is an isomorphic mapping
between distinct prefixes of patterns in P and
nodes in K. - Every node corresponds to a prefix of a pattern
in P. - Conversely, every prefix of a pattern maps to a
node in K.
31Keyword Trees (section 3.4)
- If n is the total length of all patterns in P,
then we can construct K in O(n), assuming a fixed
S. - Let Ki denote the partial keyword tree that
encodes patterns P1,.. Pi of P.
32Keyword Trees (section 3.4)
- Consider partial keyword tree K1
- comprised of a single path of P1 edges out of
root r. - Each edge is labeled with one character of P1
- Reading from the root to the leaf spells out P1
- The leaf is labeled 1
33Keyword Trees (section 3.4)
- Creating K2 from K1
- Find the longest path from the root of K1 that
matches a prefix of P2. - This paths ends by
- Either exhausting the characters of P2 or
- Ending at some existing node v in K1 where no
extending match is possible. - In case 2a) label the node where the path ends 2.
- In case 2b) create a new path out of v, labeled
by the remaining characters of P2.
34Keyword Trees (section 3.4)
- Example P1 is potato
- P2 is pot
- P2 is potty
35Keyword Trees (section 3.4)
- Use of keyword trees for matching
- Finding occurrences of patterns in P that occur
starting at position l in T - Starting at the root r in K, follow the unique
path that matches a substring of T that starts at
l. - Numbered nodes along this path indicate matched
patterns in P that start at position l. - This takes time proportional to min(n, m)
- Traversing K for each position l in T gives O(nm)
- This can be improved!
36Keyword Tree Speedup
- Observation Our naïve keyword tree is like the
naïve approach to string comparison. - Every time we increment l, we start all over at
the root of K ? O(nm) - Recall KMP avoided O(nm) by shifting to get a
speedup. - Q Is there an analogous operation we can perform
in K ? - A Of course, why else would I ask a rhetorical
question?
37Keyword Tree Speedup
- First, we assume Pi ? Pj for all combinations
Pi,Pj in P. - Next, each node v in K is labeled with the string
formed by concatenating the letters from the root
to v. - Defn. Let L(v) denote the label of node v.
- Defn. Let lp(v) denote the length of the longest
proper suffix of string L(v) that is a prefix of
some pattern in P.
38Keyword Tree Speedup
- Example L(v) potat, lp(v) 2, the suffix at
is the prefix of P4.
39Keyword Tree Speedup
- Note if a is the lp(v)-length suffix of L(v),
then there is a unique node labeled a. - Example at is the lp(v)-length suffix of L(v),
w is the unique node labeled at.
40Keyword Tree Speedup
- Defn For node v of K let nv be the unique node
in K labeled with the suffix of L(v) of length
lp(v). When lp(v) 0 then nv is the root of K. - Defn The ordered pair (v,nv) is called a failure
link. - Example
41Aho-Corasick (section 3.4.6)
- Algorithm AC search
- l 1
- c 1
- w root of K
- Repeat
- While there is an edge (w,w) labeled
character T(c) - if w is numbered by pattern i then
- report that Pi occurs in T starting
at position l - w w and c c 1
-
- w nw and l c - lp(w)
- Until c gt m
- Note if the root fails to match increment c and
the repeat loop again.
42Aho-Corasick
When l 4 there is a match of pot, but the next
position fails. At this point c 9. The failure
link points to the node labeled at and lp(v) 2.
? l c lp(v) 9 2 7
43Computing nv in Linear Time
- Note if v is the root r or 1 character away from
r, then nv r. - Imagine nv has been computed for for every node
that is exactly k or fewer edges from r. - How can we compute nv for v, a node k1 edges
from r?
44Computing nv in Linear Time
- We are looking for nv and L(nv).
- Let v be the parent of v in K and x the
character on the edge connecting them. - nv is known since v is k edges from r.
- Clearly, L(nv) must be a suffix of L(nv)
followed by x. - First check if there is an edge (nv,w) with
label x. - If so, then nv w.
- O/w L(nv) is a proper suffix of L(nv) followed
by x. - Examine nnv for an outgoing edge labeled x.
- If no joy, keep repeating, finally setting nv
r, if we run out of edges.
45BYP Method
- BYP method has O(m) expected running time.
- Partition P into r-length regions, r ?n/(k1)?
- Q How many r-length regions of P are there?
- A k1, there may be an additional short region.
- Suppose there is a match of P T with at most k
differences. - Q What can we deduce about the corresponding
r-length regions? - AThere must be at least one r-length interval
that exactly matches.
46BYP Method
- BYP Algorithm
- Let P be the set of the first k1 substrings of
Ps partitioning. - Build a keyword tree for the set of patterns P.
- Use Aho-Corasik to find I, the set of starting
locations in T where a pattern in P occurs
exactly. - For each i ? I use approximate matching to locate
end points of approximate occurrences of P in
Ti-n-k..ink