Bioinformatics Algorithms and Data Structures - PowerPoint PPT Presentation

About This Presentation
Title:

Bioinformatics Algorithms and Data Structures

Description:

Farthest-reaching d-path in a diagonal. O(km) time and space solution. Primer selection problem ... Sooooo let's do that now. UNIVERSITY OF SOUTH CAROLINA ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 47
Provided by: john244
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Algorithms and Data Structures


1
Bioinformatics Algorithms and Data Structures
  • Chapter 12.2.4 k-difference Inexact Matching
  • Lecturer Dr. Rose
  • Slides by Dr. Rose
  • February 15, 2007

2
Overview
  • k-difference inexact matching
  • Concepts
  • d-path
  • Farthest-reaching d-path in a diagonal
  • O(km) time and space solution
  • Primer selection problem
  • Formulations
  • Exact matching primer
  • Inexact matching primer
  • k-difference primer
  • O(km) time solution to k-difference primer problem

3
Overview
  • Exclusion methods fast expected time O(m)
  • Partition approaches
  • BYP algorithm
  • Aho-Corasick exact matching algorithm
  • Keyword trees
  • Back to Aho-Corasick exact matching algorithm
  • Algorithm for computing failure links
  • Back to BYP algorithm

4
K-difference Inexact Matching
  • Like k-mismatch problem allows mismatches
  • Harder than k-mismatch
  • allows spaces
  • End spaces in T are not counted
  • P T can be vastly different
  • ? cant focus on a 2k1 band centered around the
    diagonal.

5
K-difference Inexact Matching
  • Defn
  • Diagonals above the main diagonal are numbered 1
    through m. Diagonal i starts in cell (0,i).
  • Diagonals below the main diagonal are numbered -1
    through 1n. Diagonal -i starts in cell (i,0).
  • Row 0 is initialized to be all zeros.
  • Recall T can have free end spaces
  • Setting row 0 to be zeros allows the left end of
    T to start after a gap without any cost.

6
K-difference Inexact Matching
  • Defn a d-path is a path that starts in row 0 and
    specifies exactly d mismatches spaces.
  • Defn a d-path is a farthest-reaching in diagonal
    i if it ends in diagonal i and the index of its
    ending column c is ? the ending column of any
    other d-path ending in diagonal i.
  • You can visualize this as a d-path that ends
    farthest in diagonal i.

7
K-difference Inexact Matching
  • Approach
  • Iterate (1?d ?k )
  • find the farthest-reaching d-path for each
    diagonal i, (-n ?i ? m)
  • The farthest-reaching d-path for diagonal i is
    found from the farthest-reaching (d-1)-paths on
    diagonals i-1, i and i1.
  • Observation and d-path reaching row n
    corresponds to a d-difference occurrence of P in
    T.

8
K-difference Inexact Matching
  • Observation a farthest reaching 0-path in
    diagonal i is the longest match of Ti..m and
    P1..n.
  • Q Why is this true?
  • A 0-path means an exact match ? no deviation
    from the diagonal that you start on.
  • Using suffix trees
  • Build the suffix tree in linear time (linear in
    m).
  • Retrieve farthest-reaching 0-paths in constant
    time/path.

9
K-difference Inexact Matching
  • Q How do we find the farthest-reaching d-path on
    diagonal i for d gt 0?
  • A The d-path for diagonal i depends on the
    previously found (d-1)-paths on diagonals i-1, i
    and i1.
  • The 3 cases are
  • Path R1, the farthest-reaching (d-1)-path on
    diagonal i1, followed by a vertical edge to
    diagonal i.

10
K-difference Inexact Matching
  • Since R1 is a (d-1)-path on diagonal i1,
    extending it by a vertical edge (adding a space
    in T) to diagonal i makes it a d-path on diagonal
    i.

11
K-difference Inexact Matching
  • The 2nd case is
  • Path R2, the farthest-reaching (d-1)-path on
    diagonal i-1, followed by a horizontal edge to
    diagonal i.
  • Again extending a (d-1)-path into a d-path on
    diagonal i.

12
K-difference Inexact Matching
  • Path R3, the farthest-reaching (d-1)-path on
    diagonal i, followed by a diagonal edge
    corresponding to a mismatch.
  • Again extending a (d-1)-path into a d-path on
    diagonal i.

13
K-difference Inexact Matching
  • Each of R1, R2, and R3, is initially a
    farthest-reaching (d-1)-path on diagonal i-1, i,
    i1, respectively.
  • Each is extended by a space or a mismatch
    resulting in a d-path on diagonal i.
  • Each is subsequently extended along diagonal i.
  • The farthest-reaching d-path on diagonal i must
    be one of these.

14
k-differences Algorithm
  • d 0
  • / Calculate farthest-reaching 0-paths on
    diagonals 0 through m /
  • For i0 to m
  • Find the longest common extension between
    P1..n and Ti..m
  • / calculate d-paths by extending (d-1)-paths R1,
    R2, and R3 /
  • For d1 to k
  • For i -n to m
  • extend (d-1)-paths R1, R2, R3 on diagonals i-1,
    i, i1 to diagonal i.
  • One of these is the farthest reaching d-path on
    diagonal i.
  • A path reaching row n defines an inexact
    match of P in T containing
  • at most k differences. The column in row n
    indicates the end character in T.

15
K-difference Inexact Matching
  • Space analysis
  • For each d and i, we need to store the location
    of the ending farthest-reaching d-path.
  • d ranges from 0 to k.
  • There are (nm) diagonals.
  • ? O(km) space is required.

16
K-difference Inexact Matching
  • Time analysis
  • Constant time to retrieve 3 (d-1)-paths for
    particular d and i.
  • ? O(km) for this aspect (like k-differences
    alignment)
  • Corresponding O(km) extensions of paths along
    diagonal.
  • Each path extension is a maximal identical
    substring in P T, i.e., a longest common
    extension computation.
  • Using a suffix tree entails only constant time.
  • Creating the suffix tree entails linear
    processing of strings O(nm)
  • ? altogether O(nmkm) O(km)

17
Primer (Probe) Selection Problem
  • Problem start with two strings a and b (detailed
    description on page 178-179).
  • Exact matching version ?j gt j0, find the
    shortest substring g of a starting at aj s.t. g ?
    b.
  • Can be solved in O(ab)
  • Not too bad.
  • Inexact matching version Given parameter p, ?j gt
    j0, find the shortest substring g ? a starting at
    aj that has edit distance at least g/p from any
    substring in b.

18
Primer (Probe) Selection Problem
  • Inexact matching version Given parameter p, ?j gt
    j0, find the shortest substring g ? a starting at
    aj that has edit distance at least g?p from any
    substring in b.
  • Q How much work is this?
  • find the shortest prefix g of a with edit
    distance at least g?p from any substring in b.
  • The naïve approach appears daunting.
  • Lets look at a less intimidating formulation!

19
Primer (Probe) Selection Problem
  • Change g ? p to k
  • Convert the inexact matching problem to a
    k-differences problem.
  • This works out since in practice, g ? p must
    fall in a small range for fixed p.
  • k-difference primer problem Given parameter k,
    ?j gt j0, find the shortest substring g ? a
    starting at aj that has edit distance at least k
    from any substring in b.

20
Primer (Probe) Selection Problem
  • Approach
  • For each position j in a
  • Find the shortest prefix of aj..n with edit
    distance ? k from every substring in b.
  • Q How does this compare with the k-differences
    inexact matching problem?
  • A It is the opposite problem.
  • Find matches with at most k differences,
  • versus
  • Reject matches of prefixes of aj..n with
    substrings of b with fewer than k differences.

21
Primer (Probe) Selection Problem
  • Solution
  • Use k-differences algorithm.
  • Use aj..n in the place of P.
  • Use b in the place of T.
  • Compute the farthest-reaching d-path, d k, in
    each diagonal.
  • d-paths, d lt k, reaching row n, mean no solution
    at j
  • Q Why?
  • A a d-path, d lt k, indicates aj..n matches a
    substring of b with fewer than k differences.

22
Primer (Probe) Selection Problem
  • Solution
  • Only if no farthest-reaching (k-1)-paths reaches
    row n can there be a primer at position j.
  • In particular, if no farthest-reaching
    (k-1)-paths reaches row r lt n then aj..r is a
    primer if r is the smallest row with this
    property.
  • Repeat this approach for every potential starting
    position j in a.
  • Analysis if a n and b m, then the
    algorithm takes time O(knm).

23
Exclusion Methods
  • Q Can we improve on the Q(km) time we have seen
    for k-mismatch and k-difference?
  • A On average, yes. (Are we quibbling?)
  • We adopt a fast expected algorithm lt Q(km)
  • ? the worst case may not be better than Q(km)

24
Exclusion Methods
  • Partition Idea exclude much of T from the search
  • Preliminaries
  • Let a S, where S is the alphabet used in P
    and T.
  • Let n P , and m T .
  • Defn. an approximate occurrence of P is an
    occurrence with at most k mismatches or
    differences.
  • General Partition algorithm three phases
  • Partition phase
  • Search Phase
  • Check Phase

25
Exclusion Methods
  • Partition phase
  • Partition either T or P into r-length regions
    (depends on particular algorithm)
  • Search Phase
  • Use exact matching to search T for r-length
    intervals
  • These are potential targets for approximate
    occurrences of P.
  • Eliminate as many intervals as possible.
  • Check Phase
  • Use approximate matching to check for an
    approximate occurrence of P around each surviving
    interval for the search phase.

26
BYP Method
  • BYP method has O(m) expected running time.
  • Partition P into r-length regions, r ?n/(k1)?
  • Q How many r-length regions of P are there?
  • A k1, there may be an additional short region.
  • Suppose there is a match of P T with at most k
    differences.
  • Q What can we deduce about the corresponding
    r-length regions?
  • AThere must be at least one r-length interval
    that exactly matches.

27
BYP Method
  • BYP Algorithm
  • Let P be the set of the first k1 substrings of
    Ps partitioning.
  • Build a keyword tree for the set of patterns P.
  • Use Aho-Corasik to find I, the set of starting
    locations in T where a pattern in P occurs
    exactly.
  • ..
  • Oops! We havent talked about keyword trees or
    Aho-Corasik. Sooooo lets do that now.

28
Keyword Trees (section 3.4)
  • Defn. The keyword tree for set P is a rooted
    directed tree K satisfying
  • Each edge is labeled with one character
  • Any two edges out of the same node have distinct
    labels.
  • Every pattern Pi in P maps to some node v of K
    s.t. the path from the root to v spells out Pi
  • Every leaf in K is mapped by some pattern in P.

29
Keyword Trees
  • Example From textbook P potato, poetry,
    pottery, science, school

30
Keyword Trees (section 3.4)
  • Observation there is an isomorphic mapping
    between distinct prefixes of patterns in P and
    nodes in K.
  • Every node corresponds to a prefix of a pattern
    in P.
  • Conversely, every prefix of a pattern maps to a
    node in K.

31
Keyword Trees (section 3.4)
  • If n is the total length of all patterns in P,
    then we can construct K in O(n), assuming a fixed
    S.
  • Let Ki denote the partial keyword tree that
    encodes patterns P1,.. Pi of P.

32
Keyword Trees (section 3.4)
  • Consider partial keyword tree K1
  • comprised of a single path of P1 edges out of
    root r.
  • Each edge is labeled with one character of P1
  • Reading from the root to the leaf spells out P1
  • The leaf is labeled 1

33
Keyword Trees (section 3.4)
  • Creating K2 from K1
  • Find the longest path from the root of K1 that
    matches a prefix of P2.
  • This paths ends by
  • Either exhausting the characters of P2 or
  • Ending at some existing node v in K1 where no
    extending match is possible.
  • In case 2a) label the node where the path ends 2.
  • In case 2b) create a new path out of v, labeled
    by the remaining characters of P2.

34
Keyword Trees (section 3.4)
  • Example P1 is potato
  • P2 is pot
  • P2 is potty

35
Keyword Trees (section 3.4)
  • Use of keyword trees for matching
  • Finding occurrences of patterns in P that occur
    starting at position l in T
  • Starting at the root r in K, follow the unique
    path that matches a substring of T that starts at
    l.
  • Numbered nodes along this path indicate matched
    patterns in P that start at position l.
  • This takes time proportional to min(n, m)
  • Traversing K for each position l in T gives O(nm)
  • This can be improved!

36
Keyword Tree Speedup
  • Observation Our naïve keyword tree is like the
    naïve approach to string comparison.
  • Every time we increment l, we start all over at
    the root of K ? O(nm)
  • Recall KMP avoided O(nm) by shifting to get a
    speedup.
  • Q Is there an analogous operation we can perform
    in K ?
  • A Of course, why else would I ask a rhetorical
    question?

37
Keyword Tree Speedup
  • First, we assume Pi ? Pj for all combinations
    Pi,Pj in P.
  • Next, each node v in K is labeled with the string
    formed by concatenating the letters from the root
    to v.
  • Defn. Let L(v) denote the label of node v.
  • Defn. Let lp(v) denote the length of the longest
    proper suffix of string L(v) that is a prefix of
    some pattern in P.

38
Keyword Tree Speedup
  • Example L(v) potat, lp(v) 2, the suffix at
    is the prefix of P4.

39
Keyword Tree Speedup
  • Note if a is the lp(v)-length suffix of L(v),
    then there is a unique node labeled a.
  • Example at is the lp(v)-length suffix of L(v),
    w is the unique node labeled at.

40
Keyword Tree Speedup
  • Defn For node v of K let nv be the unique node
    in K labeled with the suffix of L(v) of length
    lp(v). When lp(v) 0 then nv is the root of K.
  • Defn The ordered pair (v,nv) is called a failure
    link.
  • Example

41
Aho-Corasick (section 3.4.6)
  • Algorithm AC search
  • l 1
  • c 1
  • w root of K
  • Repeat
  • While there is an edge (w,w) labeled
    character T(c)
  • if w is numbered by pattern i then
  • report that Pi occurs in T starting
    at position l
  • w w and c c 1
  • w nw and l c - lp(w)
  • Until c gt m
  • Note if the root fails to match increment c and
    the repeat loop again.

42
Aho-Corasick
  • Example T hotpotattach

When l 4 there is a match of pot, but the next
position fails. At this point c 9. The failure
link points to the node labeled at and lp(v) 2.
? l c lp(v) 9 2 7
43
Computing nv in Linear Time
  • Note if v is the root r or 1 character away from
    r, then nv r.
  • Imagine nv has been computed for for every node
    that is exactly k or fewer edges from r.
  • How can we compute nv for v, a node k1 edges
    from r?

44
Computing nv in Linear Time
  • We are looking for nv and L(nv).
  • Let v be the parent of v in K and x the
    character on the edge connecting them.
  • nv is known since v is k edges from r.
  • Clearly, L(nv) must be a suffix of L(nv)
    followed by x.
  • First check if there is an edge (nv,w) with
    label x.
  • If so, then nv w.
  • O/w L(nv) is a proper suffix of L(nv) followed
    by x.
  • Examine nnv for an outgoing edge labeled x.
  • If no joy, keep repeating, finally setting nv
    r, if we run out of edges.

45
BYP Method
  • BYP method has O(m) expected running time.
  • Partition P into r-length regions, r ?n/(k1)?
  • Q How many r-length regions of P are there?
  • A k1, there may be an additional short region.
  • Suppose there is a match of P T with at most k
    differences.
  • Q What can we deduce about the corresponding
    r-length regions?
  • AThere must be at least one r-length interval
    that exactly matches.

46
BYP Method
  • BYP Algorithm
  • Let P be the set of the first k1 substrings of
    Ps partitioning.
  • Build a keyword tree for the set of patterns P.
  • Use Aho-Corasik to find I, the set of starting
    locations in T where a pattern in P occurs
    exactly.
  • For each i ? I use approximate matching to locate
    end points of approximate occurrences of P in
    Ti-n-k..ink
Write a Comment
User Comments (0)
About PowerShow.com