Title: Bioinformatics Algorithms and Data Structures
1Bioinformatics Algorithms and Data Structures
- Chapter 12.3 Exclusion Methods
- Lecturer Dr. Rose
- Slides by Dr. Rose
- February 20, 2003
2Overview
- Exclusion methods fast expected time O(m)
- Partition approaches
- BYP algorithm
- Aho-Corasick exact matching algorithm
- Keyword trees
- Back to Aho-Corasick exact matching algorithm
- Algorithm for computing failure links
- Back to BYP algorithm
3Exclusion Methods
- Q Can we improve on the Q(km) time we have seen
for k-mismatch and k-difference? - A On average, yes. (Are we quibbling?)
- We adopt a fast expected algorithm lt Q(km)
- ? the worst case may not be better than Q(km)
4Exclusion Methods
- Partition Idea exclude much of T from the search
- Preliminaries
- Let a S, where S is the alphabet used in P
and T. - Let n P , and m T .
- Defn. an approximate occurrence of P is an
occurrence with at most k mismatches or
differences. - General Partition algorithm three phases
- Partition phase
- Search Phase
- Check Phase
5Exclusion Methods
- Partition phase
- Partition either T or P into r-length regions
(depends on particular algorithm) - Search Phase
- Use exact matching to search T for r-length
intervals - These are potential targets for approximate
occurrences of P. - Eliminate as many intervals as possible.
- Check Phase
- Use approximate matching to check for an
approximate occurrence of P around each surviving
interval for the search phase.
6BYP Method
- BYP method has O(m) expected running time.
- Partition P into r-length regions, r ?n/(k1)?
- Q How many r-length regions of P are there?
- A k1, there may be an additional short region.
- Suppose there is a match of P T with at most k
differences. - Q What can we deduce about the corresponding
r-length regions? - AThere must be at least one r-length interval
that exactly matches.
7BYP Method
- BYP Algorithm
- Let P be the set of the first k1 substrings of
Ps partitioning. - Build a keyword tree for the set of patterns P.
- Use Aho-Corasik to find I, the set of starting
locations in T where a pattern in P occurs
exactly. - ..
- Oops! We havent talked about keyword trees or
Aho-Corasik. Sooooo lets do that now.
8Keyword Trees (section 3.4)
- Defn. The keyword tree for set P is a rooted
directed tree K satisfying - Each edge is labeled with one character
- Any two edges out of the same node have distinct
labels. - Every pattern Pi in P maps to some node v of K
s.t. the path from the root to v spells out Pi - Every leaf in K is mapped by some pattern in P.
9Keyword Trees
- Example From textbook P potato, poetry,
pottery, science, school
10Keyword Trees (section 3.4)
- Observation there is an isomorphic mapping
between distinct prefixes of patterns in P and
nodes in K. - Every node corresponds to a prefix of a pattern
in P. - Conversely, every prefix of a pattern maps to a
node in K.
11Keyword Trees (section 3.4)
- If n is the total length of all patterns in P,
then we can construct K in O(n), assuming a fixed
S. - Let Ki denote the partial keyword tree that
encodes patterns P1,.. Pi of P.
12Keyword Trees (section 3.4)
- Consider partial keyword tree K1
- comprised of a single path of P1 edges out of
root r. - Each edge is labeled with one character of P1
- Reading from the root to the leaf spells out P1
- The leaf is labeled 1
13Keyword Trees (section 3.4)
- Creating K2 from K1
- Find the longest path from the root of K1 that
matches a prefix of P2. - This paths ends by
- Either exhausting the characters of P2 or
- Ending at some existing node v in K1 where no
extending match is possible. - In case 2a) label the node where the path ends 2.
- In case 2b) create a new path out of v, labeled
by the remaining characters of P2.
14Keyword Trees (section 3.4)
- Example P1 is potato
- P2 is pot
- P2 is potty
15Keyword Trees (section 3.4)
- Use of keyword trees for matching
- Finding occurrences of patterns in P that occur
starting at position l in T - Starting at the root r in K, follow the unique
path that matches a substring of T that starts at
l. - Numbered nodes along this path indicate matched
patterns in P that start at position l. - This takes time proportional to min(n, m)
- Traversing K for each position l in T gives O(nm)
- This can be improved!
16Keyword Tree Speedup
- Observation Our naïve keyword tree is like the
naïve approach to string comparison. - Every time we increment l, we start all over at
the root of K ? O(nm) - Recall KMP avoided O(nm) by shifting to get a
speedup. - Q Is there an analogous operation we can perform
in K ? - A Of course, why else would I ask a rhetorical
question?
17Keyword Tree Speedup
- First, we assume Pi ? Pj for all combinations
Pi,Pj in P. - (This assumption will be relaxed later in the
full Aho-Corasick Alg.) - Next, each node v in K is labeled with the string
formed by concatenating the letters from the root
to v. - Defn. Let L(v) denote the label of node v.
- Defn. Let lp(v) denote the length of the longest
proper suffix of string L(v) that is a prefix of
some pattern in P.
18Keyword Tree Speedup
- Example L(v) potat, lp(v) 2, the suffix at
is the prefix of P4.
19Keyword Tree Speedup
- Note if a is the lp(v)-length suffix of L(v),
then there is a unique node labeled a. - Example at is the lp(v)-length suffix of L(v),
w is the unique node labeled at.
20Keyword Tree Speedup
- Defn For node v of K let nv be the unique node
in K labeled with the suffix of L(v) of length
lp(v). When lp(v) 0 then nv is the root of K. - Defn The ordered pair (v,nv) is called a failure
link. - Example
21Aho-Corasick (section 3.4.6)
- Algorithm AC search
- (we assume Pi ? Pj for all combinations Pi,Pj in
P.) - l 1
- c 1
- w root of K
- Repeat
- While there is an edge (w,w) labeled
character T(c) - if w is numbered by pattern i then
- report that Pi occurs in T starting
at position l - w w and c c 1
-
- w nw and l c - lp(w)
- Until c gt m
- Note if the root fails to match increment c and
the repeat loop again.
22Aho-Corasick
When l 4 there is a match of pot, but the next
position fails. At this point c 9. The failure
link points to the node labeled at and lp(v) 2.
? l c lp(v) 9 2 7
23Computing nv in Linear Time
- Note if v is the root r or 1 character away from
r, then nv r. - Imagine nv has been computed for for every node
that is exactly k or fewer edges from r. - How can we compute nv for v, a node k1 edges
from r? (We also want L(nv).)
24Computing nv in Linear Time
- We are looking for nv and L(nv).
- Let v be the parent of v in K and x the
character on the edge connecting them. - nv is known since v is k edges from r.
- Clearly, L(nv) must be a suffix of L(nv)
followed by x. - First check if there is an edge (nv,w) with
label x. - If so, then nv w.
- O/w L(nv) is a proper suffix of L(nv) followed
by x. - Examine nnv for an outgoing edge labeled x.
- If no joy, keep repeating, finally setting nv
r, if we run out of edges.
25Computing nv in Linear Time
- Algorithm nv
- / Initialization /
- v parent(v) in K and
- x the character on the edge (v,v)
- w nv
- / computation /
- While ((?? edge labeled x out of node w) (w ?
r)) w nw - if (? edge (w,w) label x) nv w
- else nv r
26Computing nv in Linear Time
- Thm. Alg. nv takes O(n) when applied to all nodes
in K, where n is the length of all patterns in K. - Q How can we demonstrate this?
- Consider pattern P in P, where t is the length of
P. - Since lp(v) ? lp(v) 1 ? lp() is increased by
at most t. - But the assignment (w nw) decreases lp().
- If w is assigned k times then lp(v) ? lp(v)
k. - ? Since lp() is never negative, the assignment
(w nw) is bound by t.
27Computing nv in Linear Time
- Thm. Cont.
- The total time is proportional to the loop
- While ((?? edge labeled x out of node w) (w ?
r)) w nw - Since the loops is bound by t, the length of P,
- all failure links on the path for P are set in
O(t) time. - We can apply the same logic to the other patterns
in P to yield a linear computation for all
failure links.
28Aho-Corasick (section 3.4.6)
- Relaxing the substring assumption ? i.e., Pi ? Pj
for all combinations Pi,Pj in P. - Consider P acatt, ca, T acatg
- T is matched along K until the letter g is
reached. - For the current node v, L(v) acat.
- There is no edge labeled g out of v.
- No proper suffix of acat is a prefix of acatt or
ca - Therefore nv is the root.
- Return to the root and set l 5
- At this point the algorithm will fail to match g.
- It does not find the occurrence of ca in T.
- Q What went wrong?
29Aho-Corasick (section 3.4.6)
- A The problem is that the algorithm is increases
l (shifts) to match the longest suffix of L(v)
with a prefix of some P in P. - P is not necessarily a suffix of T even if P is
embedded in T. - Soln Report fully embedded patterns as they
appear.
30Aho-Corasick (section 3.4.6)
- Q How do we find the fully embedded patterns as
they appear? - Lemmas 3.4.2 3.4.3
- Lemma 3.4.2 Pattern Pi must occur in T ending at
position c if node v is reached and there is a
directed path of failure links in K from a node v
to a node numbered with pattern i. - Lemma 3.4.3 If node v is reached then pattern Pi
occurs in T ending at position c only if v is
numbered i or there is a directed path of failure
links v to a node numbered i.
31Aho-Corasick (section 3.4.6)
- Algorithm full AC search
- (No assumption that Pi ? Pj for all combinations
Pi,Pj in P.) - (Report embedded patterns as they appear.)
- l 1
- c 1
- w root of K
- Repeat
- While there is an edge (w,w) labeled
character T(c) - if w is numbered by pattern i
- or there is a directed path of failure links
from w to a node numbered with i then - report that Pi occurs in T ending at
position c - w w and c c 1
-
- w nw and l c - lp(w)
- Until c gt m.
32Aho-Corasick (section 3.4.6)
- Q How do we recognize directed failure-link
paths to pattern-numbered nodes? - Idea associate with each node its its pattern
numbered node, if there is one. - Q Where should this be done?
- Algorithm nv must be extended.
- Caveat the time can not be more than linear!
33Aho-Corasick (section 3.4.6)
- Idea associate with each node its
pattern-numbered node, if there is one. - Mechanism create an output link from the node to
its pattern-numbered node. - How
- Compute the failure link to nv for node v.
- If nv is a pattern-numbered node vs set output
link to nv. - O/w if nv has an output link to w, set vs output
link to w. - O/w v has no output link.
- This takes O(n) time.
34Aho-Corasick (section 3.4.6)
- Analysis
- Preprocessing of patterns in P can be done in
O(n) - All occurrences in T of P ? P can be found in
time O(m k). - m is the length of T
- k is the number of occurrences of patterns P ? P
. Here we are counting the time to output each
occurrence. - Overall, we get O(nmk) time.
35BYP Method
- BYP method has O(m) expected running time.
- Partition P into r-length regions, r ?n/(k1)?
- Q How many r-length regions of P are there?
- A k1, there may be an additional short region.
- Lemma 12.3.1 Suppose there is a match of P T
with at most k differences. - Q What can we deduce about the corresponding
r-length regions? - AThere must be at least one r-length interval
that exactly matches.
36BYP Method
- BYP Algorithm
- Let P be the set of the first k1 substrings of
Ps partitioning. - Build a keyword tree for the set of patterns P.
- Use Aho-Corasik to find I, the set of starting
locations in T where a pattern in P occurs
exactly. - For each i ? I use approximate matching to locate
end points of approximate occurrences of P in
Ti-n-k..ink
37BYP Method
- Q Why do we choose the range i-n-k..ink in T,
i.e., Ti-n-k..ink ? - What is n?
- What is k?
- Why (n k) ? (n k) ?
38BYP Method
- BYP considers all possible places for an
occurrence of P in T. (lemma 12.3.1) - Step b Building the keyword tree takes O(n)
- Step c Aho-Corasick takes O(m) (since O(mk) is
O(m)) - Note there are other approaches for steps b c
(pg 272) - Step d takes time O(n) for each index in I.
- Recall, that we are interested in expected
running time O(m). Worst case may be larger.
39BYP Expected Time
- From the previous slide steps b c are already
O(m) worst case. (Why is this true?) - Ananlysis of Step d assume
- The size of our alphabet is s (s S )
- T has uniform distribution of characters
- P can be any arbitrary string
- All p ? P have the same length, r.
- What is the expected occurrence of an arbitrary
length r substring in T if T r? - A 1/sr (see next slide for explanation)
40BYP Expected Time
- A 1/sr because we assume a uniform distribution
of characters in T. - The probability of any specific character is 1/s.
- The probability of any sequence of k characters
is 1/sk. - However, T ? r, in fact T ?? r.
- Q If there are m substrings of length r in T,
what is the expected number of exact occurrences
of p in T? - A m/sr
- Q What are the expected number of occurrences in
T of patterns in P? - A m(k1)/sr (recall there are k1 patterns in P)
41BYP Expected Time
- Q How long does it take to check for a single
approximate occurrence of P in T in step d? - A dynamic programming gives O(n2) (global
alignment) - Expected checking time in step d is then
n2m(k1)/sr - For O(m) time, we need to choose k s.t.
42BYP Expected Time
- To simplify, substitute n-1 for k, and solve for
r in
Then sr n3/c , so r logs n3 - logs c But we
are given
43BYP Expected Time
- So for k n/(log n), BYP has an expected
running time of O(m) - Q Where did n/(log n) come from?
- A It is a simple expression for k in terms of n
that satisfies our equation. - To see this, substitute n/(log n) for k into
44BYP Expected Time
Letting sr n3/c , you get
Which simplifies