Title: Motif finding problem
1Motif finding problem
- Some important conserved sequences come in short
dispersed blocks called motifs. - How can we find such motifs? If examples of the
sequence exist this is relatively
straightforward but what if they dont? - Problem is common in DNA motif finding, e.g.
transcription-factor binding sites, splicing
regulators. - Usually cannot just multiply align and take the
best aligning region because spacing, order, and
information content too variable. - We need a method that will scan through all the
sequences at all positions looking for short
regions of similarity.
2Short motifs may be anywhere in sequences
We want to find the position of this motif AND
we dont know where it is or what its sequence
is.
3Motif-finding Approaches
- Expectation maximization (EM)
- Gibbs sampling
- Phylogeny-accountable methods (variants of what
we will discuss, but account for how related
target sequences are).
4Sequence Probability
- We need to define the probability of seeing a
particular residue sequence of interest in a
larger sequence block. - For this to make sense, we need a specific
expectation (a model) for the sequence and a
background model for the complete sequence. - This is similar to amino acid score matrices,
but we will define a different scoring method for
motif finding, based on entropy.
5A simple example model for a 3-nucleotide sequence
Consider a case in which the model favors A then
T then T
6Computer algorithm digression
Using
The P-value quickly becomes extremely small. Even
using double precision, it will exceed available
precision for protein sequences longer than about
20 residues. So
7Sequence matches and backgrounds
- If the sequences we are searching have a
composition identical to our expected sequence,
background model not needed. - But to generalize we will assume the background
might NOT have the same composition.
8Computing background probability
Simplest case is where all nucleotides are
equally frequent
General case make residue counts for the entire
sequence set that you are analyzing.
9Differentiating from background
Now consider a test sequence that is nearly all A
and T. This background predicts
Compared to one that is nearly all C and G
Clearly we should be much more impressed if we
find ATT in the second case
10Putting them together
We want to know the DEGREE to which our model
favors a particular sequence compared to
background
This can be a very large number (gtgt1) if the
sequence matches the model much better or a small
number (lt1) if the sequence matches the
background better.
11Entropy measure
- this method uses a pure entropy measure of
residue conservation in each column. - the less variable the column is, the better it
scores (low entropy). - takes no account of amino acid similarities
(e.g. L, V, I all have very similar side chains).
12For entropy measures, where do the model qi come
from?
note this is a strict entropy definition of
conservation
13Deriving one column of the motif model
12 sequences, pseudocounts are commonly taken to
be proportional to the general amino acid
frequency (Fi)
(for this example I have also taken background
from the general amino acid frequencies)
14Gibbs Sampler starting state
A single motif of fixed width chosen at random
from each of many sequences
15One round of Gibbs Sampling
- one sampled motif is removed from the set of N
motifs. - the remaining N-1 motifs are aligned and a motif
model is derived from them as before. - the motif model is used to test the match
quality at every position in the removed sequence
by
16Selecting a motif match
Assign a quality value according to
17sampling round cont.
- Select one among all possible positions
according to their match quality. - Can be either probabilistic (strict sampling
approach) or fixed at single best match
(converges a little faster but more prone to
getting stuck at local maxima).
18Iterate!!
At each iteration, simply remove a randomly
chosen sequence from the model set and 1)
recompute the model 2) scan along the removed
sequence 3) probabilistically select a
match The removed sequence can be chosen in a
fixed order and/or the best match can be chosen
probabilistically or strictly. note fixed
order and fixed best match is close to an EM
(expectation maximization) algorithm
19early iterations have no real pattern
20 but by chance two real motifs line up
- once two real motif alignments occur, they
induce a slight bias toward selecting other
copies of similar sequences during the iterative
process. - here for example is an early Gibbs Sampler run
where some likely real matches have been made. - two of them have probably recruited additional
members
note in this example, motif regions are in
similar positions in all the sequences not
necessary
21After two real motifs align, they bias new choices
Match quality
probabilistically favored because slightly better
match
22If a good set of motifs is found
Match quality
23End point of a run
Overview
Alignment
24Local optima, noise, and sliding
Two very different types of local optima can
occur
The easy one is really a pseudo-optimum and is
solved by motif sliding
25Local optima, noise, and sliding (cont).
- The more difficult local optima are true
(favorably scoring) motif alignments that are not
the best (might not even be close in space or
score quality). - Reduced by the noisy algorithm tends to
jiggle out of poor local optima. - Require multiple independently-started runs to
solve you simply take the best motif set found
(but no guarantee!).
26Run graphic sampler example