Motif finding problem - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Motif finding problem

Description:

Consider a case in which the model favors A then T then T: Computer algorithm digression ... the motif model is used to test the match quality at every position ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 27

Provided by: jamesh78

Category:

more less

Transcript and Presenter's Notes

Title: Motif finding problem

1
Motif finding problem

Some important conserved sequences come in short
dispersed blocks called motifs.
How can we find such motifs? If examples of the
sequence exist this is relatively
straightforward but what if they dont?
Problem is common in DNA motif finding, e.g.
transcription-factor binding sites, splicing
regulators.
Usually cannot just multiply align and take the
best aligning region because spacing, order, and
information content too variable.
We need a method that will scan through all the
sequences at all positions looking for short
regions of similarity.

2
Short motifs may be anywhere in sequences
We want to find the position of this motif AND
we dont know where it is or what its sequence
is.
3
Motif-finding Approaches

Expectation maximization (EM)
Gibbs sampling
Phylogeny-accountable methods (variants of what
we will discuss, but account for how related
target sequences are).

4
Sequence Probability

We need to define the probability of seeing a
particular residue sequence of interest in a
larger sequence block.
For this to make sense, we need a specific
expectation (a model) for the sequence and a
background model for the complete sequence.
This is similar to amino acid score matrices,
but we will define a different scoring method for
motif finding, based on entropy.

5
A simple example model for a 3-nucleotide sequence
Consider a case in which the model favors A then
T then T
6
Computer algorithm digression
Using
The P-value quickly becomes extremely small. Even
using double precision, it will exceed available
precision for protein sequences longer than about
20 residues. So
7
Sequence matches and backgrounds

If the sequences we are searching have a
composition identical to our expected sequence,
background model not needed.
But to generalize we will assume the background
might NOT have the same composition.

8
Computing background probability
Simplest case is where all nucleotides are
equally frequent
General case make residue counts for the entire
sequence set that you are analyzing.
9
Differentiating from background
Now consider a test sequence that is nearly all A
and T. This background predicts
Compared to one that is nearly all C and G
Clearly we should be much more impressed if we
find ATT in the second case
10
Putting them together
We want to know the DEGREE to which our model
favors a particular sequence compared to
background
This can be a very large number (gtgt1) if the
sequence matches the model much better or a small
number (lt1) if the sequence matches the
background better.
11
Entropy measure

this method uses a pure entropy measure of
residue conservation in each column.
the less variable the column is, the better it
scores (low entropy).
takes no account of amino acid similarities
(e.g. L, V, I all have very similar side chains).

12
For entropy measures, where do the model qi come
from?
note this is a strict entropy definition of
conservation
13
Deriving one column of the motif model
12 sequences, pseudocounts are commonly taken to
be proportional to the general amino acid
frequency (Fi)
(for this example I have also taken background
from the general amino acid frequencies)
14
Gibbs Sampler starting state
A single motif of fixed width chosen at random
from each of many sequences
15
One round of Gibbs Sampling

one sampled motif is removed from the set of N
motifs.
the remaining N-1 motifs are aligned and a motif
model is derived from them as before.
the motif model is used to test the match
quality at every position in the removed sequence
by

16
Selecting a motif match
Assign a quality value according to
17
sampling round cont.

Select one among all possible positions
according to their match quality.
Can be either probabilistic (strict sampling
approach) or fixed at single best match
(converges a little faster but more prone to
getting stuck at local maxima).

18
Iterate!!
At each iteration, simply remove a randomly
chosen sequence from the model set and 1)
recompute the model 2) scan along the removed
sequence 3) probabilistically select a
match The removed sequence can be chosen in a
fixed order and/or the best match can be chosen
probabilistically or strictly. note fixed
order and fixed best match is close to an EM
(expectation maximization) algorithm
19
early iterations have no real pattern
20
but by chance two real motifs line up

once two real motif alignments occur, they
induce a slight bias toward selecting other
copies of similar sequences during the iterative
process.
here for example is an early Gibbs Sampler run
where some likely real matches have been made.
two of them have probably recruited additional
members

note in this example, motif regions are in
similar positions in all the sequences not
necessary
21
After two real motifs align, they bias new choices
Match quality
probabilistically favored because slightly better
match
22
If a good set of motifs is found
Match quality
23
End point of a run
Overview
Alignment
24
Local optima, noise, and sliding
Two very different types of local optima can
occur
The easy one is really a pseudo-optimum and is
solved by motif sliding
25
Local optima, noise, and sliding (cont).

The more difficult local optima are true
(favorably scoring) motif alignments that are not
the best (might not even be close in space or
score quality).
Reduced by the noisy algorithm tends to
jiggle out of poor local optima.
Require multiple independently-started runs to
solve you simply take the best motif set found
(but no guarantee!).

26
Run graphic sampler example

Write a Comment

User Comments (0)