Bioinformatics Algorithms and Data Structures - PowerPoint PPT Presentation

About This Presentation

Title:

Bioinformatics Algorithms and Data Structures

Description:

Q: Can we improve on the Q(km) time we have seen for k-mismatch and k-difference? ... Sooooo let's do that now. UNIVERSITY OF SOUTH CAROLINA ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 45

Provided by: john244

Learn more at: https://www.cse.sc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics Algorithms and Data Structures

1
Bioinformatics Algorithms and Data Structures

Chapter 12.3 Exclusion Methods
Lecturer Dr. Rose
Slides by Dr. Rose
February 20, 2003

2
Overview

Exclusion methods fast expected time O(m)
Partition approaches
BYP algorithm
Aho-Corasick exact matching algorithm
Keyword trees
Back to Aho-Corasick exact matching algorithm
Algorithm for computing failure links
Back to BYP algorithm

3
Exclusion Methods

Q Can we improve on the Q(km) time we have seen
for k-mismatch and k-difference?
A On average, yes. (Are we quibbling?)
We adopt a fast expected algorithm lt Q(km)
? the worst case may not be better than Q(km)

4
Exclusion Methods

Partition Idea exclude much of T from the search
Preliminaries
Let a S, where S is the alphabet used in P
and T.
Let n P , and m T .
Defn. an approximate occurrence of P is an
occurrence with at most k mismatches or
differences.
General Partition algorithm three phases
Partition phase
Search Phase
Check Phase

5
Exclusion Methods

Partition phase
Partition either T or P into r-length regions
(depends on particular algorithm)
Search Phase
Use exact matching to search T for r-length
intervals
These are potential targets for approximate
occurrences of P.
Eliminate as many intervals as possible.
Check Phase
Use approximate matching to check for an
approximate occurrence of P around each surviving
interval for the search phase.

6
BYP Method

BYP method has O(m) expected running time.
Partition P into r-length regions, r ?n/(k1)?
Q How many r-length regions of P are there?
A k1, there may be an additional short region.
Suppose there is a match of P T with at most k
differences.
Q What can we deduce about the corresponding
r-length regions?
AThere must be at least one r-length interval
that exactly matches.

7
BYP Method

BYP Algorithm
Let P be the set of the first k1 substrings of
Ps partitioning.
Build a keyword tree for the set of patterns P.
Use Aho-Corasik to find I, the set of starting
locations in T where a pattern in P occurs
exactly.
..
Oops! We havent talked about keyword trees or
Aho-Corasik. Sooooo lets do that now.

8
Keyword Trees (section 3.4)

Defn. The keyword tree for set P is a rooted
directed tree K satisfying
Each edge is labeled with one character
Any two edges out of the same node have distinct
labels.
Every pattern Pi in P maps to some node v of K
s.t. the path from the root to v spells out Pi
Every leaf in K is mapped by some pattern in P.

9
Keyword Trees

Example From textbook P potato, poetry,
pottery, science, school

10
Keyword Trees (section 3.4)

Observation there is an isomorphic mapping
between distinct prefixes of patterns in P and
nodes in K.
Every node corresponds to a prefix of a pattern
in P.
Conversely, every prefix of a pattern maps to a
node in K.

11
Keyword Trees (section 3.4)

If n is the total length of all patterns in P,
then we can construct K in O(n), assuming a fixed
S.
Let Ki denote the partial keyword tree that
encodes patterns P1,.. Pi of P.

12
Keyword Trees (section 3.4)

Consider partial keyword tree K1
comprised of a single path of P1 edges out of
root r.
Each edge is labeled with one character of P1
Reading from the root to the leaf spells out P1
The leaf is labeled 1

13
Keyword Trees (section 3.4)

Creating K2 from K1
Find the longest path from the root of K1 that
matches a prefix of P2.
This paths ends by
Either exhausting the characters of P2 or
Ending at some existing node v in K1 where no
extending match is possible.
In case 2a) label the node where the path ends 2.
In case 2b) create a new path out of v, labeled
by the remaining characters of P2.

14
Keyword Trees (section 3.4)

Example P1 is potato
P2 is pot
P2 is potty

15
Keyword Trees (section 3.4)

Use of keyword trees for matching
Finding occurrences of patterns in P that occur
starting at position l in T
Starting at the root r in K, follow the unique
path that matches a substring of T that starts at
l.
Numbered nodes along this path indicate matched
patterns in P that start at position l.
This takes time proportional to min(n, m)
Traversing K for each position l in T gives O(nm)
This can be improved!

16
Keyword Tree Speedup

Observation Our naïve keyword tree is like the
naïve approach to string comparison.
Every time we increment l, we start all over at
the root of K ? O(nm)
Recall KMP avoided O(nm) by shifting to get a
speedup.
Q Is there an analogous operation we can perform
in K ?
A Of course, why else would I ask a rhetorical
question?

17
Keyword Tree Speedup

First, we assume Pi ? Pj for all combinations
Pi,Pj in P.
(This assumption will be relaxed later in the
full Aho-Corasick Alg.)
Next, each node v in K is labeled with the string
formed by concatenating the letters from the root
to v.
Defn. Let L(v) denote the label of node v.
Defn. Let lp(v) denote the length of the longest
proper suffix of string L(v) that is a prefix of
some pattern in P.

18
Keyword Tree Speedup

Example L(v) potat, lp(v) 2, the suffix at
is the prefix of P4.

19
Keyword Tree Speedup

Note if a is the lp(v)-length suffix of L(v),
then there is a unique node labeled a.
Example at is the lp(v)-length suffix of L(v),
w is the unique node labeled at.

20
Keyword Tree Speedup

Defn For node v of K let nv be the unique node
in K labeled with the suffix of L(v) of length
lp(v). When lp(v) 0 then nv is the root of K.
Defn The ordered pair (v,nv) is called a failure
link.
Example

21
Aho-Corasick (section 3.4.6)

Algorithm AC search
(we assume Pi ? Pj for all combinations Pi,Pj in
P.)
l 1
c 1
w root of K
Repeat
While there is an edge (w,w) labeled
character T(c)
if w is numbered by pattern i then
report that Pi occurs in T starting
at position l
w w and c c 1
w nw and l c - lp(w)
Until c gt m
Note if the root fails to match increment c and
the repeat loop again.

22
Aho-Corasick

Example T hotpotattach

When l 4 there is a match of pot, but the next
position fails. At this point c 9. The failure
link points to the node labeled at and lp(v) 2.
? l c lp(v) 9 2 7
23
Computing nv in Linear Time

Note if v is the root r or 1 character away from
r, then nv r.
Imagine nv has been computed for for every node
that is exactly k or fewer edges from r.
How can we compute nv for v, a node k1 edges
from r? (We also want L(nv).)

24
Computing nv in Linear Time

We are looking for nv and L(nv).
Let v be the parent of v in K and x the
character on the edge connecting them.
nv is known since v is k edges from r.
Clearly, L(nv) must be a suffix of L(nv)
followed by x.
First check if there is an edge (nv,w) with
label x.
If so, then nv w.
O/w L(nv) is a proper suffix of L(nv) followed
by x.
Examine nnv for an outgoing edge labeled x.
If no joy, keep repeating, finally setting nv
r, if we run out of edges.

25
Computing nv in Linear Time

Algorithm nv
/ Initialization /
v parent(v) in K and
x the character on the edge (v,v)
w nv
/ computation /
While ((?? edge labeled x out of node w) (w ?
r)) w nw
if (? edge (w,w) label x) nv w
else nv r

26
Computing nv in Linear Time

Thm. Alg. nv takes O(n) when applied to all nodes
in K, where n is the length of all patterns in K.
Q How can we demonstrate this?
Consider pattern P in P, where t is the length of
P.
Since lp(v) ? lp(v) 1 ? lp() is increased by
at most t.
But the assignment (w nw) decreases lp().
If w is assigned k times then lp(v) ? lp(v)
k.
? Since lp() is never negative, the assignment
(w nw) is bound by t.

27
Computing nv in Linear Time

Thm. Cont.
The total time is proportional to the loop
While ((?? edge labeled x out of node w) (w ?
r)) w nw
Since the loops is bound by t, the length of P,
all failure links on the path for P are set in
O(t) time.
We can apply the same logic to the other patterns
in P to yield a linear computation for all
failure links.

28
Aho-Corasick (section 3.4.6)

Relaxing the substring assumption ? i.e., Pi ? Pj
for all combinations Pi,Pj in P.
Consider P acatt, ca, T acatg
T is matched along K until the letter g is
reached.
For the current node v, L(v) acat.
There is no edge labeled g out of v.
No proper suffix of acat is a prefix of acatt or
ca
Therefore nv is the root.
Return to the root and set l 5
At this point the algorithm will fail to match g.
It does not find the occurrence of ca in T.
Q What went wrong?

29
Aho-Corasick (section 3.4.6)

A The problem is that the algorithm is increases
l (shifts) to match the longest suffix of L(v)
with a prefix of some P in P.
P is not necessarily a suffix of T even if P is
embedded in T.
Soln Report fully embedded patterns as they
appear.

30
Aho-Corasick (section 3.4.6)

Q How do we find the fully embedded patterns as
they appear?
Lemmas 3.4.2 3.4.3
Lemma 3.4.2 Pattern Pi must occur in T ending at
position c if node v is reached and there is a
directed path of failure links in K from a node v
to a node numbered with pattern i.
Lemma 3.4.3 If node v is reached then pattern Pi
occurs in T ending at position c only if v is
numbered i or there is a directed path of failure
links v to a node numbered i.

31
Aho-Corasick (section 3.4.6)

Algorithm full AC search
(No assumption that Pi ? Pj for all combinations
Pi,Pj in P.)
(Report embedded patterns as they appear.)
l 1
c 1
w root of K
Repeat
While there is an edge (w,w) labeled
character T(c)
if w is numbered by pattern i
or there is a directed path of failure links
from w to a node numbered with i then
report that Pi occurs in T ending at
position c
w w and c c 1
w nw and l c - lp(w)
Until c gt m.

32
Aho-Corasick (section 3.4.6)

Q How do we recognize directed failure-link
paths to pattern-numbered nodes?
Idea associate with each node its its pattern
numbered node, if there is one.
Q Where should this be done?
Algorithm nv must be extended.
Caveat the time can not be more than linear!

33
Aho-Corasick (section 3.4.6)

Idea associate with each node its
pattern-numbered node, if there is one.
Mechanism create an output link from the node to
its pattern-numbered node.
How
Compute the failure link to nv for node v.
If nv is a pattern-numbered node vs set output
link to nv.
O/w if nv has an output link to w, set vs output
link to w.
O/w v has no output link.
This takes O(n) time.

34
Aho-Corasick (section 3.4.6)

Analysis
Preprocessing of patterns in P can be done in
O(n)
All occurrences in T of P ? P can be found in
time O(m k).
m is the length of T
k is the number of occurrences of patterns P ? P
. Here we are counting the time to output each
occurrence.
Overall, we get O(nmk) time.

35
BYP Method

BYP method has O(m) expected running time.
Partition P into r-length regions, r ?n/(k1)?
Q How many r-length regions of P are there?
A k1, there may be an additional short region.
Lemma 12.3.1 Suppose there is a match of P T
with at most k differences.
Q What can we deduce about the corresponding
r-length regions?
AThere must be at least one r-length interval
that exactly matches.

36
BYP Method

BYP Algorithm
Let P be the set of the first k1 substrings of
Ps partitioning.
Build a keyword tree for the set of patterns P.
Use Aho-Corasik to find I, the set of starting
locations in T where a pattern in P occurs
exactly.
For each i ? I use approximate matching to locate
end points of approximate occurrences of P in
Ti-n-k..ink

37
BYP Method

Q Why do we choose the range i-n-k..ink in T,
i.e., Ti-n-k..ink ?
What is n?
What is k?
Why (n k) ? (n k) ?

38
BYP Method

BYP considers all possible places for an
occurrence of P in T. (lemma 12.3.1)
Step b Building the keyword tree takes O(n)
Step c Aho-Corasick takes O(m) (since O(mk) is
O(m))
Note there are other approaches for steps b c
(pg 272)
Step d takes time O(n) for each index in I.
Recall, that we are interested in expected
running time O(m). Worst case may be larger.

39
BYP Expected Time

From the previous slide steps b c are already
O(m) worst case. (Why is this true?)
Ananlysis of Step d assume
The size of our alphabet is s (s S )
T has uniform distribution of characters
P can be any arbitrary string
All p ? P have the same length, r.
What is the expected occurrence of an arbitrary
length r substring in T if T r?
A 1/sr (see next slide for explanation)

40
BYP Expected Time

A 1/sr because we assume a uniform distribution
of characters in T.
The probability of any specific character is 1/s.
The probability of any sequence of k characters
is 1/sk.
However, T ? r, in fact T ?? r.
Q If there are m substrings of length r in T,
what is the expected number of exact occurrences
of p in T?
A m/sr
Q What are the expected number of occurrences in
T of patterns in P?
A m(k1)/sr (recall there are k1 patterns in P)

41
BYP Expected Time

Q How long does it take to check for a single
approximate occurrence of P in T in step d?
A dynamic programming gives O(n2) (global
alignment)
Expected checking time in step d is then
n2m(k1)/sr
For O(m) time, we need to choose k s.t.

42
BYP Expected Time