Bioinformatics Algorithms and Data Structures - PowerPoint PPT Presentation

About This Presentation

Title:

Bioinformatics Algorithms and Data Structures

Description:

Farthest-reaching d-path in a diagonal. O(km) time and space solution. Primer selection problem ... Sooooo let's do that now. UNIVERSITY OF SOUTH CAROLINA ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 47

Provided by: john244

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics Algorithms and Data Structures

1
Bioinformatics Algorithms and Data Structures

Chapter 12.2.4 k-difference Inexact Matching
Lecturer Dr. Rose
Slides by Dr. Rose
February 15, 2007

2
Overview

k-difference inexact matching
Concepts
d-path
Farthest-reaching d-path in a diagonal
O(km) time and space solution
Primer selection problem
Formulations
Exact matching primer
Inexact matching primer
k-difference primer
O(km) time solution to k-difference primer problem

3
Overview

Exclusion methods fast expected time O(m)
Partition approaches
BYP algorithm
Aho-Corasick exact matching algorithm
Keyword trees
Back to Aho-Corasick exact matching algorithm
Algorithm for computing failure links
Back to BYP algorithm

4
K-difference Inexact Matching

Like k-mismatch problem allows mismatches
Harder than k-mismatch
allows spaces
End spaces in T are not counted
P T can be vastly different
? cant focus on a 2k1 band centered around the
diagonal.

5
K-difference Inexact Matching

Defn
Diagonals above the main diagonal are numbered 1
through m. Diagonal i starts in cell (0,i).
Diagonals below the main diagonal are numbered -1
through 1n. Diagonal -i starts in cell (i,0).
Row 0 is initialized to be all zeros.
Recall T can have free end spaces
Setting row 0 to be zeros allows the left end of
T to start after a gap without any cost.

6
K-difference Inexact Matching

Defn a d-path is a path that starts in row 0 and
specifies exactly d mismatches spaces.
Defn a d-path is a farthest-reaching in diagonal
i if it ends in diagonal i and the index of its
ending column c is ? the ending column of any
other d-path ending in diagonal i.
You can visualize this as a d-path that ends
farthest in diagonal i.

7
K-difference Inexact Matching

Approach
Iterate (1?d ?k )
find the farthest-reaching d-path for each
diagonal i, (-n ?i ? m)
The farthest-reaching d-path for diagonal i is
found from the farthest-reaching (d-1)-paths on
diagonals i-1, i and i1.
Observation and d-path reaching row n
corresponds to a d-difference occurrence of P in
T.

8
K-difference Inexact Matching

Observation a farthest reaching 0-path in
diagonal i is the longest match of Ti..m and
P1..n.
Q Why is this true?
A 0-path means an exact match ? no deviation
from the diagonal that you start on.
Using suffix trees
Build the suffix tree in linear time (linear in
m).
Retrieve farthest-reaching 0-paths in constant
time/path.

9
K-difference Inexact Matching

Q How do we find the farthest-reaching d-path on
diagonal i for d gt 0?
A The d-path for diagonal i depends on the
previously found (d-1)-paths on diagonals i-1, i
and i1.
The 3 cases are
Path R1, the farthest-reaching (d-1)-path on
diagonal i1, followed by a vertical edge to
diagonal i.

10
K-difference Inexact Matching

Since R1 is a (d-1)-path on diagonal i1,
extending it by a vertical edge (adding a space
in T) to diagonal i makes it a d-path on diagonal
i.

11
K-difference Inexact Matching

The 2nd case is
Path R2, the farthest-reaching (d-1)-path on
diagonal i-1, followed by a horizontal edge to
diagonal i.
Again extending a (d-1)-path into a d-path on
diagonal i.

12
K-difference Inexact Matching

Path R3, the farthest-reaching (d-1)-path on
diagonal i, followed by a diagonal edge
corresponding to a mismatch.
Again extending a (d-1)-path into a d-path on
diagonal i.

13
K-difference Inexact Matching

Each of R1, R2, and R3, is initially a
farthest-reaching (d-1)-path on diagonal i-1, i,
i1, respectively.
Each is extended by a space or a mismatch
resulting in a d-path on diagonal i.
Each is subsequently extended along diagonal i.
The farthest-reaching d-path on diagonal i must
be one of these.

14
k-differences Algorithm

d 0
/ Calculate farthest-reaching 0-paths on
diagonals 0 through m /
For i0 to m
Find the longest common extension between
P1..n and Ti..m
/ calculate d-paths by extending (d-1)-paths R1,
R2, and R3 /
For d1 to k
For i -n to m
extend (d-1)-paths R1, R2, R3 on diagonals i-1,
i, i1 to diagonal i.
One of these is the farthest reaching d-path on
diagonal i.
A path reaching row n defines an inexact
match of P in T containing
at most k differences. The column in row n
indicates the end character in T.

15
K-difference Inexact Matching

Space analysis
For each d and i, we need to store the location
of the ending farthest-reaching d-path.
d ranges from 0 to k.
There are (nm) diagonals.
? O(km) space is required.

16
K-difference Inexact Matching

Time analysis
Constant time to retrieve 3 (d-1)-paths for
particular d and i.
? O(km) for this aspect (like k-differences
alignment)
Corresponding O(km) extensions of paths along
diagonal.
Each path extension is a maximal identical
substring in P T, i.e., a longest common
extension computation.
Using a suffix tree entails only constant time.
Creating the suffix tree entails linear
processing of strings O(nm)
? altogether O(nmkm) O(km)

17
Primer (Probe) Selection Problem

Problem start with two strings a and b (detailed
description on page 178-179).
Exact matching version ?j gt j0, find the
shortest substring g of a starting at aj s.t. g ?
b.
Can be solved in O(ab)
Not too bad.
Inexact matching version Given parameter p, ?j gt
j0, find the shortest substring g ? a starting at
aj that has edit distance at least g/p from any
substring in b.

18
Primer (Probe) Selection Problem

Inexact matching version Given parameter p, ?j gt
j0, find the shortest substring g ? a starting at
aj that has edit distance at least g?p from any
substring in b.
Q How much work is this?
find the shortest prefix g of a with edit
distance at least g?p from any substring in b.
The naïve approach appears daunting.
Lets look at a less intimidating formulation!

19
Primer (Probe) Selection Problem

Change g ? p to k
Convert the inexact matching problem to a
k-differences problem.
This works out since in practice, g ? p must
fall in a small range for fixed p.
k-difference primer problem Given parameter k,
?j gt j0, find the shortest substring g ? a
starting at aj that has edit distance at least k
from any substring in b.

20
Primer (Probe) Selection Problem

Approach
For each position j in a
Find the shortest prefix of aj..n with edit
distance ? k from every substring in b.
Q How does this compare with the k-differences
inexact matching problem?
A It is the opposite problem.
Find matches with at most k differences,
versus
Reject matches of prefixes of aj..n with
substrings of b with fewer than k differences.

21
Primer (Probe) Selection Problem

Solution
Use k-differences algorithm.
Use aj..n in the place of P.
Use b in the place of T.
Compute the farthest-reaching d-path, d k, in
each diagonal.
d-paths, d lt k, reaching row n, mean no solution
at j
Q Why?
A a d-path, d lt k, indicates aj..n matches a
substring of b with fewer than k differences.

22
Primer (Probe) Selection Problem

Solution
Only if no farthest-reaching (k-1)-paths reaches
row n can there be a primer at position j.
In particular, if no farthest-reaching
(k-1)-paths reaches row r lt n then aj..r is a
primer if r is the smallest row with this
property.
Repeat this approach for every potential starting
position j in a.
Analysis if a n and b m, then the
algorithm takes time O(knm).

23
Exclusion Methods

Q Can we improve on the Q(km) time we have seen
for k-mismatch and k-difference?
A On average, yes. (Are we quibbling?)
We adopt a fast expected algorithm lt Q(km)
? the worst case may not be better than Q(km)

24
Exclusion Methods

Partition Idea exclude much of T from the search
Preliminaries
Let a S, where S is the alphabet used in P
and T.
Let n P , and m T .
Defn. an approximate occurrence of P is an
occurrence with at most k mismatches or
differences.
General Partition algorithm three phases
Partition phase
Search Phase
Check Phase

25
Exclusion Methods

Partition phase
Partition either T or P into r-length regions
(depends on particular algorithm)
Search Phase
Use exact matching to search T for r-length
intervals
These are potential targets for approximate
occurrences of P.
Eliminate as many intervals as possible.
Check Phase
Use approximate matching to check for an
approximate occurrence of P around each surviving
interval for the search phase.

26
BYP Method

BYP method has O(m) expected running time.
Partition P into r-length regions, r ?n/(k1)?
Q How many r-length regions of P are there?
A k1, there may be an additional short region.
Suppose there is a match of P T with at most k
differences.
Q What can we deduce about the corresponding
r-length regions?
AThere must be at least one r-length interval
that exactly matches.

27
BYP Method

BYP Algorithm
Let P be the set of the first k1 substrings of
Ps partitioning.
Build a keyword tree for the set of patterns P.
Use Aho-Corasik to find I, the set of starting
locations in T where a pattern in P occurs
exactly.
..
Oops! We havent talked about keyword trees or
Aho-Corasik. Sooooo lets do that now.

28
Keyword Trees (section 3.4)

Defn. The keyword tree for set P is a rooted
directed tree K satisfying
Each edge is labeled with one character
Any two edges out of the same node have distinct
labels.
Every pattern Pi in P maps to some node v of K
s.t. the path from the root to v spells out Pi
Every leaf in K is mapped by some pattern in P.

29
Keyword Trees

Example From textbook P potato, poetry,
pottery, science, school

30
Keyword Trees (section 3.4)

Observation there is an isomorphic mapping
between distinct prefixes of patterns in P and
nodes in K.
Every node corresponds to a prefix of a pattern
in P.
Conversely, every prefix of a pattern maps to a
node in K.

31
Keyword Trees (section 3.4)

If n is the total length of all patterns in P,
then we can construct K in O(n), assuming a fixed
S.
Let Ki denote the partial keyword tree that
encodes patterns P1,.. Pi of P.

32
Keyword Trees (section 3.4)

Consider partial keyword tree K1
comprised of a single path of P1 edges out of
root r.
Each edge is labeled with one character of P1
Reading from the root to the leaf spells out P1
The leaf is labeled 1

33
Keyword Trees (section 3.4)

Creating K2 from K1
Find the longest path from the root of K1 that
matches a prefix of P2.
This paths ends by
Either exhausting the characters of P2 or
Ending at some existing node v in K1 where no
extending match is possible.
In case 2a) label the node where the path ends 2.
In case 2b) create a new path out of v, labeled
by the remaining characters of P2.

34
Keyword Trees (section 3.4)

Example P1 is potato
P2 is pot
P2 is potty

35
Keyword Trees (section 3.4)

Use of keyword trees for matching
Finding occurrences of patterns in P that occur
starting at position l in T
Starting at the root r in K, follow the unique
path that matches a substring of T that starts at
l.
Numbered nodes along this path indicate matched
patterns in P that start at position l.
This takes time proportional to min(n, m)
Traversing K for each position l in T gives O(nm)
This can be improved!

36
Keyword Tree Speedup

Observation Our naïve keyword tree is like the
naïve approach to string comparison.
Every time we increment l, we start all over at
the root of K ? O(nm)
Recall KMP avoided O(nm) by shifting to get a
speedup.
Q Is there an analogous operation we can perform
in K ?
A Of course, why else would I ask a rhetorical
question?

37
Keyword Tree Speedup

First, we assume Pi ? Pj for all combinations
Pi,Pj in P.
Next, each node v in K is labeled with the string
formed by concatenating the letters from the root
to v.
Defn. Let L(v) denote the label of node v.
Defn. Let lp(v) denote the length of the longest
proper suffix of string L(v) that is a prefix of
some pattern in P.

38
Keyword Tree Speedup

Example L(v) potat, lp(v) 2, the suffix at
is the prefix of P4.

39
Keyword Tree Speedup

Note if a is the lp(v)-length suffix of L(v),
then there is a unique node labeled a.
Example at is the lp(v)-length suffix of L(v),
w is the unique node labeled at.

40
Keyword Tree Speedup

Defn For node v of K let nv be the unique node
in K labeled with the suffix of L(v) of length
lp(v). When lp(v) 0 then nv is the root of K.
Defn The ordered pair (v,nv) is called a failure
link.
Example

41
Aho-Corasick (section 3.4.6)

Algorithm AC search
l 1
c 1
w root of K
Repeat
While there is an edge (w,w) labeled
character T(c)
if w is numbered by pattern i then
report that Pi occurs in T starting
at position l
w w and c c 1
w nw and l c - lp(w)
Until c gt m
Note if the root fails to match increment c and
the repeat loop again.

42
Aho-Corasick

Example T hotpotattach

When l 4 there is a match of pot, but the next
position fails. At this point c 9. The failure
link points to the node labeled at and lp(v) 2.
? l c lp(v) 9 2 7
43
Computing nv in Linear Time

Note if v is the root r or 1 character away from
r, then nv r.
Imagine nv has been computed for for every node
that is exactly k or fewer edges from r.
How can we compute nv for v, a node k1 edges
from r?

44
Computing nv in Linear Time

We are looking for nv and L(nv).
Let v be the parent of v in K and x the
character on the edge connecting them.
nv is known since v is k edges from r.
Clearly, L(nv) must be a suffix of L(nv)
followed by x.
First check if there is an edge (nv,w) with
label x.
If so, then nv w.
O/w L(nv) is a proper suffix of L(nv) followed
by x.
Examine nnv for an outgoing edge labeled x.
If no joy, keep repeating, finally setting nv
r, if we run out of edges.

45
BYP Method

BYP method has O(m) expected running time.
Partition P into r-length regions, r ?n/(k1)?
Q How many r-length regions of P are there?
A k1, there may be an additional short region.
Suppose there is a match of P T with at most k
differences.
Q What can we deduce about the corresponding
r-length regions?
AThere must be at least one r-length interval
that exactly matches.

46
BYP Method

BYP Algorithm
Let P be the set of the first k1 substrings of
Ps partitioning.
Build a keyword tree for the set of patterns P.
Use Aho-Corasik to find I, the set of starting
locations in T where a pattern in P occurs
exactly.
For each i ? I use approximate matching to locate
end points of approximate occurrences of P in
Ti-n-k..ink