Bioinformatics Algorithms and Data Structures - PowerPoint PPT Presentation

About This Presentation
Title:

Bioinformatics Algorithms and Data Structures

Description:

Q: Can we improve on the Q(km) time we have seen for k-mismatch and k-difference? ... Sooooo let's do that now. UNIVERSITY OF SOUTH CAROLINA ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 45
Provided by: john244
Learn more at: https://www.cse.sc.edu
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Algorithms and Data Structures


1
Bioinformatics Algorithms and Data Structures
  • Chapter 12.3 Exclusion Methods
  • Lecturer Dr. Rose
  • Slides by Dr. Rose
  • February 20, 2003

2
Overview
  • Exclusion methods fast expected time O(m)
  • Partition approaches
  • BYP algorithm
  • Aho-Corasick exact matching algorithm
  • Keyword trees
  • Back to Aho-Corasick exact matching algorithm
  • Algorithm for computing failure links
  • Back to BYP algorithm

3
Exclusion Methods
  • Q Can we improve on the Q(km) time we have seen
    for k-mismatch and k-difference?
  • A On average, yes. (Are we quibbling?)
  • We adopt a fast expected algorithm lt Q(km)
  • ? the worst case may not be better than Q(km)

4
Exclusion Methods
  • Partition Idea exclude much of T from the search
  • Preliminaries
  • Let a S, where S is the alphabet used in P
    and T.
  • Let n P , and m T .
  • Defn. an approximate occurrence of P is an
    occurrence with at most k mismatches or
    differences.
  • General Partition algorithm three phases
  • Partition phase
  • Search Phase
  • Check Phase

5
Exclusion Methods
  • Partition phase
  • Partition either T or P into r-length regions
    (depends on particular algorithm)
  • Search Phase
  • Use exact matching to search T for r-length
    intervals
  • These are potential targets for approximate
    occurrences of P.
  • Eliminate as many intervals as possible.
  • Check Phase
  • Use approximate matching to check for an
    approximate occurrence of P around each surviving
    interval for the search phase.

6
BYP Method
  • BYP method has O(m) expected running time.
  • Partition P into r-length regions, r ?n/(k1)?
  • Q How many r-length regions of P are there?
  • A k1, there may be an additional short region.
  • Suppose there is a match of P T with at most k
    differences.
  • Q What can we deduce about the corresponding
    r-length regions?
  • AThere must be at least one r-length interval
    that exactly matches.

7
BYP Method
  • BYP Algorithm
  • Let P be the set of the first k1 substrings of
    Ps partitioning.
  • Build a keyword tree for the set of patterns P.
  • Use Aho-Corasik to find I, the set of starting
    locations in T where a pattern in P occurs
    exactly.
  • ..
  • Oops! We havent talked about keyword trees or
    Aho-Corasik. Sooooo lets do that now.

8
Keyword Trees (section 3.4)
  • Defn. The keyword tree for set P is a rooted
    directed tree K satisfying
  • Each edge is labeled with one character
  • Any two edges out of the same node have distinct
    labels.
  • Every pattern Pi in P maps to some node v of K
    s.t. the path from the root to v spells out Pi
  • Every leaf in K is mapped by some pattern in P.

9
Keyword Trees
  • Example From textbook P potato, poetry,
    pottery, science, school

10
Keyword Trees (section 3.4)
  • Observation there is an isomorphic mapping
    between distinct prefixes of patterns in P and
    nodes in K.
  • Every node corresponds to a prefix of a pattern
    in P.
  • Conversely, every prefix of a pattern maps to a
    node in K.

11
Keyword Trees (section 3.4)
  • If n is the total length of all patterns in P,
    then we can construct K in O(n), assuming a fixed
    S.
  • Let Ki denote the partial keyword tree that
    encodes patterns P1,.. Pi of P.

12
Keyword Trees (section 3.4)
  • Consider partial keyword tree K1
  • comprised of a single path of P1 edges out of
    root r.
  • Each edge is labeled with one character of P1
  • Reading from the root to the leaf spells out P1
  • The leaf is labeled 1

13
Keyword Trees (section 3.4)
  • Creating K2 from K1
  • Find the longest path from the root of K1 that
    matches a prefix of P2.
  • This paths ends by
  • Either exhausting the characters of P2 or
  • Ending at some existing node v in K1 where no
    extending match is possible.
  • In case 2a) label the node where the path ends 2.
  • In case 2b) create a new path out of v, labeled
    by the remaining characters of P2.

14
Keyword Trees (section 3.4)
  • Example P1 is potato
  • P2 is pot
  • P2 is potty

15
Keyword Trees (section 3.4)
  • Use of keyword trees for matching
  • Finding occurrences of patterns in P that occur
    starting at position l in T
  • Starting at the root r in K, follow the unique
    path that matches a substring of T that starts at
    l.
  • Numbered nodes along this path indicate matched
    patterns in P that start at position l.
  • This takes time proportional to min(n, m)
  • Traversing K for each position l in T gives O(nm)
  • This can be improved!

16
Keyword Tree Speedup
  • Observation Our naïve keyword tree is like the
    naïve approach to string comparison.
  • Every time we increment l, we start all over at
    the root of K ? O(nm)
  • Recall KMP avoided O(nm) by shifting to get a
    speedup.
  • Q Is there an analogous operation we can perform
    in K ?
  • A Of course, why else would I ask a rhetorical
    question?

17
Keyword Tree Speedup
  • First, we assume Pi ? Pj for all combinations
    Pi,Pj in P.
  • (This assumption will be relaxed later in the
    full Aho-Corasick Alg.)
  • Next, each node v in K is labeled with the string
    formed by concatenating the letters from the root
    to v.
  • Defn. Let L(v) denote the label of node v.
  • Defn. Let lp(v) denote the length of the longest
    proper suffix of string L(v) that is a prefix of
    some pattern in P.

18
Keyword Tree Speedup
  • Example L(v) potat, lp(v) 2, the suffix at
    is the prefix of P4.

19
Keyword Tree Speedup
  • Note if a is the lp(v)-length suffix of L(v),
    then there is a unique node labeled a.
  • Example at is the lp(v)-length suffix of L(v),
    w is the unique node labeled at.

20
Keyword Tree Speedup
  • Defn For node v of K let nv be the unique node
    in K labeled with the suffix of L(v) of length
    lp(v). When lp(v) 0 then nv is the root of K.
  • Defn The ordered pair (v,nv) is called a failure
    link.
  • Example

21
Aho-Corasick (section 3.4.6)
  • Algorithm AC search
  • (we assume Pi ? Pj for all combinations Pi,Pj in
    P.)
  • l 1
  • c 1
  • w root of K
  • Repeat
  • While there is an edge (w,w) labeled
    character T(c)
  • if w is numbered by pattern i then
  • report that Pi occurs in T starting
    at position l
  • w w and c c 1
  • w nw and l c - lp(w)
  • Until c gt m
  • Note if the root fails to match increment c and
    the repeat loop again.

22
Aho-Corasick
  • Example T hotpotattach

When l 4 there is a match of pot, but the next
position fails. At this point c 9. The failure
link points to the node labeled at and lp(v) 2.
? l c lp(v) 9 2 7
23
Computing nv in Linear Time
  • Note if v is the root r or 1 character away from
    r, then nv r.
  • Imagine nv has been computed for for every node
    that is exactly k or fewer edges from r.
  • How can we compute nv for v, a node k1 edges
    from r? (We also want L(nv).)

24
Computing nv in Linear Time
  • We are looking for nv and L(nv).
  • Let v be the parent of v in K and x the
    character on the edge connecting them.
  • nv is known since v is k edges from r.
  • Clearly, L(nv) must be a suffix of L(nv)
    followed by x.
  • First check if there is an edge (nv,w) with
    label x.
  • If so, then nv w.
  • O/w L(nv) is a proper suffix of L(nv) followed
    by x.
  • Examine nnv for an outgoing edge labeled x.
  • If no joy, keep repeating, finally setting nv
    r, if we run out of edges.

25
Computing nv in Linear Time
  • Algorithm nv
  • / Initialization /
  • v parent(v) in K and
  • x the character on the edge (v,v)
  • w nv
  • / computation /
  • While ((?? edge labeled x out of node w) (w ?
    r)) w nw
  • if (? edge (w,w) label x) nv w
  • else nv r

26
Computing nv in Linear Time
  • Thm. Alg. nv takes O(n) when applied to all nodes
    in K, where n is the length of all patterns in K.
  • Q How can we demonstrate this?
  • Consider pattern P in P, where t is the length of
    P.
  • Since lp(v) ? lp(v) 1 ? lp() is increased by
    at most t.
  • But the assignment (w nw) decreases lp().
  • If w is assigned k times then lp(v) ? lp(v)
    k.
  • ? Since lp() is never negative, the assignment
    (w nw) is bound by t.

27
Computing nv in Linear Time
  • Thm. Cont.
  • The total time is proportional to the loop
  • While ((?? edge labeled x out of node w) (w ?
    r)) w nw
  • Since the loops is bound by t, the length of P,
  • all failure links on the path for P are set in
    O(t) time.
  • We can apply the same logic to the other patterns
    in P to yield a linear computation for all
    failure links.

28
Aho-Corasick (section 3.4.6)
  • Relaxing the substring assumption ? i.e., Pi ? Pj
    for all combinations Pi,Pj in P.
  • Consider P acatt, ca, T acatg
  • T is matched along K until the letter g is
    reached.
  • For the current node v, L(v) acat.
  • There is no edge labeled g out of v.
  • No proper suffix of acat is a prefix of acatt or
    ca
  • Therefore nv is the root.
  • Return to the root and set l 5
  • At this point the algorithm will fail to match g.
  • It does not find the occurrence of ca in T.
  • Q What went wrong?

29
Aho-Corasick (section 3.4.6)
  • A The problem is that the algorithm is increases
    l (shifts) to match the longest suffix of L(v)
    with a prefix of some P in P.
  • P is not necessarily a suffix of T even if P is
    embedded in T.
  • Soln Report fully embedded patterns as they
    appear.

30
Aho-Corasick (section 3.4.6)
  • Q How do we find the fully embedded patterns as
    they appear?
  • Lemmas 3.4.2 3.4.3
  • Lemma 3.4.2 Pattern Pi must occur in T ending at
    position c if node v is reached and there is a
    directed path of failure links in K from a node v
    to a node numbered with pattern i.
  • Lemma 3.4.3 If node v is reached then pattern Pi
    occurs in T ending at position c only if v is
    numbered i or there is a directed path of failure
    links v to a node numbered i.

31
Aho-Corasick (section 3.4.6)
  • Algorithm full AC search
  • (No assumption that Pi ? Pj for all combinations
    Pi,Pj in P.)
  • (Report embedded patterns as they appear.)
  • l 1
  • c 1
  • w root of K
  • Repeat
  • While there is an edge (w,w) labeled
    character T(c)
  • if w is numbered by pattern i
  • or there is a directed path of failure links
    from w to a node numbered with i then
  • report that Pi occurs in T ending at
    position c
  • w w and c c 1
  • w nw and l c - lp(w)
  • Until c gt m.

32
Aho-Corasick (section 3.4.6)
  • Q How do we recognize directed failure-link
    paths to pattern-numbered nodes?
  • Idea associate with each node its its pattern
    numbered node, if there is one.
  • Q Where should this be done?
  • Algorithm nv must be extended.
  • Caveat the time can not be more than linear!

33
Aho-Corasick (section 3.4.6)
  • Idea associate with each node its
    pattern-numbered node, if there is one.
  • Mechanism create an output link from the node to
    its pattern-numbered node.
  • How
  • Compute the failure link to nv for node v.
  • If nv is a pattern-numbered node vs set output
    link to nv.
  • O/w if nv has an output link to w, set vs output
    link to w.
  • O/w v has no output link.
  • This takes O(n) time.

34
Aho-Corasick (section 3.4.6)
  • Analysis
  • Preprocessing of patterns in P can be done in
    O(n)
  • All occurrences in T of P ? P can be found in
    time O(m k).
  • m is the length of T
  • k is the number of occurrences of patterns P ? P
    . Here we are counting the time to output each
    occurrence.
  • Overall, we get O(nmk) time.

35
BYP Method
  • BYP method has O(m) expected running time.
  • Partition P into r-length regions, r ?n/(k1)?
  • Q How many r-length regions of P are there?
  • A k1, there may be an additional short region.
  • Lemma 12.3.1 Suppose there is a match of P T
    with at most k differences.
  • Q What can we deduce about the corresponding
    r-length regions?
  • AThere must be at least one r-length interval
    that exactly matches.

36
BYP Method
  • BYP Algorithm
  • Let P be the set of the first k1 substrings of
    Ps partitioning.
  • Build a keyword tree for the set of patterns P.
  • Use Aho-Corasik to find I, the set of starting
    locations in T where a pattern in P occurs
    exactly.
  • For each i ? I use approximate matching to locate
    end points of approximate occurrences of P in
    Ti-n-k..ink

37
BYP Method
  • Q Why do we choose the range i-n-k..ink in T,
    i.e., Ti-n-k..ink ?
  • What is n?
  • What is k?
  • Why (n k) ? (n k) ?

38
BYP Method
  • BYP considers all possible places for an
    occurrence of P in T. (lemma 12.3.1)
  • Step b Building the keyword tree takes O(n)
  • Step c Aho-Corasick takes O(m) (since O(mk) is
    O(m))
  • Note there are other approaches for steps b c
    (pg 272)
  • Step d takes time O(n) for each index in I.
  • Recall, that we are interested in expected
    running time O(m). Worst case may be larger.

39
BYP Expected Time
  • From the previous slide steps b c are already
    O(m) worst case. (Why is this true?)
  • Ananlysis of Step d assume
  • The size of our alphabet is s (s S )
  • T has uniform distribution of characters
  • P can be any arbitrary string
  • All p ? P have the same length, r.
  • What is the expected occurrence of an arbitrary
    length r substring in T if T r?
  • A 1/sr (see next slide for explanation)

40
BYP Expected Time
  • A 1/sr because we assume a uniform distribution
    of characters in T.
  • The probability of any specific character is 1/s.
  • The probability of any sequence of k characters
    is 1/sk.
  • However, T ? r, in fact T ?? r.
  • Q If there are m substrings of length r in T,
    what is the expected number of exact occurrences
    of p in T?
  • A m/sr
  • Q What are the expected number of occurrences in
    T of patterns in P?
  • A m(k1)/sr (recall there are k1 patterns in P)

41
BYP Expected Time
  • Q How long does it take to check for a single
    approximate occurrence of P in T in step d?
  • A dynamic programming gives O(n2) (global
    alignment)
  • Expected checking time in step d is then
    n2m(k1)/sr
  • For O(m) time, we need to choose k s.t.

42
BYP Expected Time
  • To simplify, substitute n-1 for k, and solve for
    r in

Then sr n3/c , so r logs n3 - logs c But we
are given
43
BYP Expected Time
  • So for k n/(log n), BYP has an expected
    running time of O(m)
  • Q Where did n/(log n) come from?
  • A It is a simple expression for k in terms of n
    that satisfies our equation.
  • To see this, substitute n/(log n) for k into

44
BYP Expected Time
Letting sr n3/c , you get
Which simplifies
Write a Comment
User Comments (0)
About PowerShow.com