Algorithms for Regulatory Motif Discovery - PowerPoint PPT Presentation

About This Presentation

Title:

Algorithms for Regulatory Motif Discovery

Description:

Algorithms for Regulatory Motif Discovery. Xiaohui Xie. University of California, Irvine ... Probabilistic model for motif discovery ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 19

Provided by: GTS6

Learn more at: https://ics.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: Algorithms for Regulatory Motif Discovery

1
Algorithms for Regulatory Motif Discovery

Xiaohui Xie
University of California, Irvine

2
Positional weight matrix representation
Lambda cI/cro binding sites
A 9 9 94 25 1 71 1 1 1 9 17 32 9 17 1 48 1 71 63
C 17 17 1 25 94 9 86 55 9 40 71 9 1 1 1 1 1 1 9
G 9 1 1 1 1 1 1 9 71 40 9 55 86 9 94 25 1 17 17
T 63 71 1 48 1 17 9 32 17 9 1 1 1 71 1 25 94 9 9
Sequence Logo
3
Probabilistic model for motif discovery

Input a set of sequence S1, S2 ,SN, each of
which is of length w, i.e. Siw for all i.
Two models
Motif model P(y?)
Background model P(y?)
Hidden variables z1, z2 ,zN, where zi 1 if
Si is a motif site and 0 otherwise.
P(Sizi1) P(Si?) and P(Sizi0) P(Si?)
P(zi1Si) P(Sizi1)P(zi1)/P(Sizi1)P(zi1)
P(Sizi0)P(zi0)

4
Probabilistic model for motif discovery

Input a set of sequences S1, S2 ,SN, each of
which is of length w, i.e. Siw for all i.
Parameter estimation problem
Find the ? and ? that maximize P(S1, S2 ,SN ?,
?)
If zi 1 for all i, then ?ij cij/N where cij is
the number of letter j occurring at position i in
the set of sequences.

5
String matching
6
Exact String Matching

Input Two strings T1n and P1m, containing
symbols from alphabet ?.
Example
? A,C,G,T
T112 CAGTACATCGAT
P1..3 AGT
Goal find all shifts 0s n-m such that
Ts1sm P

7
Simple Algorithm

for s ? 0 to n-m
Match ? 1
for j ? 1 to m
if Tsj?Pj then
Match ? 0
exit loop
if Match1 then output s

8
Analysis

Running time of the simple algorithm
Worst-case O(nm)
Average-case (random text) O(n) (expectation)
Ts time spend on checking shift s
(the number of comparisons until 1st mismatch)
ETs lt 2 (why)
ESsTs SsETs O(n)

9
Worst-case

Is it possible to achieve O(n) for any input ?
Knuth-Morris-Pratt77 deterministic
Karp-Rabin81 randomized
Digression what is the difference between
Algorithm that is fast on a random input (as seen
on the previous slide)
Randomized algorithm (as in the rest of this
lecture)

10
Karp-Rabin Algorithm

Idea semi-numerical approach
Consider all m-mers
T1m, T2m1, , Tm-n1n
Map each Ts1sm into a number ts
Map the pattern P1m into a number p
Report the m-mers that map to the same value as p
Problem how to map all m-mers in O(n) time ?

11
Implementation

Attempt I
Assume S0,1
(for A,C,G,T convert A00, C01, G10, T11)
Think about each Ts1sm as a number in binary
representation, i.e.,
tsTs12m-1Ts22m-2Tsm20
Find a fast way of computing ts1 given ts
Output all s such that ts is equal to the number
p represented by P

12
Magic formula

How to transform
tsTs12m-1Ts22m-2Tsm20
into
ts1Ts22m-1Ts32m-2Tsm120 ?
Three steps
Subtract Ts12m-1
Multiply by 2 (i.e., shift the bits by one
position)
Add Tsm120
Therefore ts1 (ts- Ts12m-1)2 Tsm120

13
Algorithm

ts1 (ts- Ts12m-1)2 Tsm120
Can compute ts1 from ts using 3 arithmetic
operations
Therefore, we can compute all t0,t1,,tn-m using
O(n) arithmetic operations
We can compute a number corresponding to P using
O(m) arithmetic operations
Are we done ?

14
Problem

To get O(n) time, we would need to perform each
arithmetic operation in O(1) time
However, the arguments are m-bit long !
If m large, it is unreasonable to assume that
operations on such big numbers can be done in
O(1) time
We need to reduce the number range to something
more managable

15
Hashing