Algorithms for Regulatory Motif Discovery - PowerPoint PPT Presentation

About This Presentation
Title:

Algorithms for Regulatory Motif Discovery

Description:

Algorithms for Regulatory Motif Discovery. Xiaohui Xie. University of California, Irvine ... Probabilistic model for motif discovery ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 19
Provided by: GTS6
Learn more at: https://ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: Algorithms for Regulatory Motif Discovery


1
Algorithms for Regulatory Motif Discovery
  • Xiaohui Xie
  • University of California, Irvine

2
Positional weight matrix representation
Lambda cI/cro binding sites
A 9 9 94 25 1 71 1 1 1 9 17 32 9 17 1 48 1 71 63
C 17 17 1 25 94 9 86 55 9 40 71 9 1 1 1 1 1 1 9
G 9 1 1 1 1 1 1 9 71 40 9 55 86 9 94 25 1 17 17
T 63 71 1 48 1 17 9 32 17 9 1 1 1 71 1 25 94 9 9
Sequence Logo
3
Probabilistic model for motif discovery
  • Input a set of sequence S1, S2 ,SN, each of
    which is of length w, i.e. Siw for all i.
  • Two models
  • Motif model P(y?)
  • Background model P(y?)
  • Hidden variables z1, z2 ,zN, where zi 1 if
    Si is a motif site and 0 otherwise.
  • P(Sizi1) P(Si?) and P(Sizi0) P(Si?)
  • P(zi1Si) P(Sizi1)P(zi1)/P(Sizi1)P(zi1)
    P(Sizi0)P(zi0)

4
Probabilistic model for motif discovery
  • Input a set of sequences S1, S2 ,SN, each of
    which is of length w, i.e. Siw for all i.
  • Parameter estimation problem
  • Find the ? and ? that maximize P(S1, S2 ,SN ?,
    ?)
  • If zi 1 for all i, then ?ij cij/N where cij is
    the number of letter j occurring at position i in
    the set of sequences.

5
String matching
6
Exact String Matching
  • Input Two strings T1n and P1m, containing
    symbols from alphabet ?.
  • Example
  • ? A,C,G,T
  • T112 CAGTACATCGAT
  • P1..3 AGT
  • Goal find all shifts 0s n-m such that
    Ts1sm P

7
Simple Algorithm
  • for s ? 0 to n-m
  • Match ? 1
  • for j ? 1 to m
  • if Tsj?Pj then
  • Match ? 0
  • exit loop
  • if Match1 then output s

8
Analysis
  • Running time of the simple algorithm
  • Worst-case O(nm)
  • Average-case (random text) O(n) (expectation)
  • Ts time spend on checking shift s
  • (the number of comparisons until 1st mismatch)
  • ETs lt 2 (why)
  • ESsTs SsETs O(n)

9
Worst-case
  • Is it possible to achieve O(n) for any input ?
  • Knuth-Morris-Pratt77 deterministic
  • Karp-Rabin81 randomized
  • Digression what is the difference between
  • Algorithm that is fast on a random input (as seen
    on the previous slide)
  • Randomized algorithm (as in the rest of this
    lecture)

10
Karp-Rabin Algorithm
  • Idea semi-numerical approach
  • Consider all m-mers
  • T1m, T2m1, , Tm-n1n
  • Map each Ts1sm into a number ts
  • Map the pattern P1m into a number p
  • Report the m-mers that map to the same value as p
  • Problem how to map all m-mers in O(n) time ?

11
Implementation
  • Attempt I
  • Assume S0,1
  • (for A,C,G,T convert A00, C01, G10, T11)
  • Think about each Ts1sm as a number in binary
    representation, i.e.,
  • tsTs12m-1Ts22m-2Tsm20
  • Find a fast way of computing ts1 given ts
  • Output all s such that ts is equal to the number
    p represented by P

12
Magic formula
  • How to transform
  • tsTs12m-1Ts22m-2Tsm20
  • into
  • ts1Ts22m-1Ts32m-2Tsm120 ?
  • Three steps
  • Subtract Ts12m-1
  • Multiply by 2 (i.e., shift the bits by one
    position)
  • Add Tsm120
  • Therefore ts1 (ts- Ts12m-1)2 Tsm120

13
Algorithm
  • ts1 (ts- Ts12m-1)2 Tsm120
  • Can compute ts1 from ts using 3 arithmetic
    operations
  • Therefore, we can compute all t0,t1,,tn-m using
    O(n) arithmetic operations
  • We can compute a number corresponding to P using
    O(m) arithmetic operations
  • Are we done ?

14
Problem
  • To get O(n) time, we would need to perform each
    arithmetic operation in O(1) time
  • However, the arguments are m-bit long !
  • If m large, it is unreasonable to assume that
    operations on such big numbers can be done in
    O(1) time
  • We need to reduce the number range to something
    more managable

15
Hashing
  • We will instead compute
  • tsTs12m-1Ts22m-2Tsm20 mod q where
    q is an appropriate prime number
  • One can still compute ts1 from ts
  • ts1 (ts- Ts12m-1)2Tsm120 mod q
  • If q is not large, we can compute all ts (and
    p) in O(n) time

16
Problem
  • Unfortunately, we can have false positives, i.e.,
    Ts1sm?P but ts mod q p mod q
  • Our approach
  • Use a random q
  • Show that the probability of a false positive is
    small
  • ? randomized algorithm

17
Aligning two sequences
  • Longest common substring (LCS) problem (no gaps)
  • Input Two strings T1n and P1m, nm
  • Goal Largest k such that
  • Ti1ik Pj1jk
  • for some i,j
  • How can we solve this problem efficiently ?
  • Hint How can we check if LCS has length k ?

18
Checking for a common m-mer
Write a Comment
User Comments (0)
About PowerShow.com