Motif finding problem - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Motif finding problem

Description:

Consider a case in which the model favors A then T then T: Computer algorithm digression ... the motif model is used to test the match quality at every position ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 27
Provided by: jamesh78
Category:
Tags: finding | motif | problem

less

Transcript and Presenter's Notes

Title: Motif finding problem


1
Motif finding problem
  • Some important conserved sequences come in short
    dispersed blocks called motifs.
  • How can we find such motifs? If examples of the
    sequence exist this is relatively
    straightforward but what if they dont?
  • Problem is common in DNA motif finding, e.g.
    transcription-factor binding sites, splicing
    regulators.
  • Usually cannot just multiply align and take the
    best aligning region because spacing, order, and
    information content too variable.
  • We need a method that will scan through all the
    sequences at all positions looking for short
    regions of similarity.

2
Short motifs may be anywhere in sequences
We want to find the position of this motif AND
we dont know where it is or what its sequence
is.
3
Motif-finding Approaches
  • Expectation maximization (EM)
  • Gibbs sampling
  • Phylogeny-accountable methods (variants of what
    we will discuss, but account for how related
    target sequences are).

4
Sequence Probability
  • We need to define the probability of seeing a
    particular residue sequence of interest in a
    larger sequence block.
  • For this to make sense, we need a specific
    expectation (a model) for the sequence and a
    background model for the complete sequence.
  • This is similar to amino acid score matrices,
    but we will define a different scoring method for
    motif finding, based on entropy.

5
A simple example model for a 3-nucleotide sequence
Consider a case in which the model favors A then
T then T
6
Computer algorithm digression
Using
The P-value quickly becomes extremely small. Even
using double precision, it will exceed available
precision for protein sequences longer than about
20 residues. So
7
Sequence matches and backgrounds
  • If the sequences we are searching have a
    composition identical to our expected sequence,
    background model not needed.
  • But to generalize we will assume the background
    might NOT have the same composition.

8
Computing background probability
Simplest case is where all nucleotides are
equally frequent
General case make residue counts for the entire
sequence set that you are analyzing.
9
Differentiating from background
Now consider a test sequence that is nearly all A
and T. This background predicts
Compared to one that is nearly all C and G
Clearly we should be much more impressed if we
find ATT in the second case
10
Putting them together
We want to know the DEGREE to which our model
favors a particular sequence compared to
background
This can be a very large number (gtgt1) if the
sequence matches the model much better or a small
number (lt1) if the sequence matches the
background better.
11
Entropy measure
  • this method uses a pure entropy measure of
    residue conservation in each column.
  • the less variable the column is, the better it
    scores (low entropy).
  • takes no account of amino acid similarities
    (e.g. L, V, I all have very similar side chains).

12
For entropy measures, where do the model qi come
from?
note this is a strict entropy definition of
conservation
13
Deriving one column of the motif model
12 sequences, pseudocounts are commonly taken to
be proportional to the general amino acid
frequency (Fi)
(for this example I have also taken background
from the general amino acid frequencies)
14
Gibbs Sampler starting state
A single motif of fixed width chosen at random
from each of many sequences
15
One round of Gibbs Sampling
  • one sampled motif is removed from the set of N
    motifs.
  • the remaining N-1 motifs are aligned and a motif
    model is derived from them as before.
  • the motif model is used to test the match
    quality at every position in the removed sequence
    by

16
Selecting a motif match
Assign a quality value according to
17
sampling round cont.
  • Select one among all possible positions
    according to their match quality.
  • Can be either probabilistic (strict sampling
    approach) or fixed at single best match
    (converges a little faster but more prone to
    getting stuck at local maxima).

18
Iterate!!
At each iteration, simply remove a randomly
chosen sequence from the model set and 1)
recompute the model 2) scan along the removed
sequence 3) probabilistically select a
match The removed sequence can be chosen in a
fixed order and/or the best match can be chosen
probabilistically or strictly. note fixed
order and fixed best match is close to an EM
(expectation maximization) algorithm
19
early iterations have no real pattern
20
but by chance two real motifs line up
  • once two real motif alignments occur, they
    induce a slight bias toward selecting other
    copies of similar sequences during the iterative
    process.
  • here for example is an early Gibbs Sampler run
    where some likely real matches have been made.
  • two of them have probably recruited additional
    members

note in this example, motif regions are in
similar positions in all the sequences not
necessary
21
After two real motifs align, they bias new choices
Match quality
probabilistically favored because slightly better
match
22
If a good set of motifs is found
Match quality
23
End point of a run
Overview
Alignment
24
Local optima, noise, and sliding
Two very different types of local optima can
occur
The easy one is really a pseudo-optimum and is
solved by motif sliding
25
Local optima, noise, and sliding (cont).
  • The more difficult local optima are true
    (favorably scoring) motif alignments that are not
    the best (might not even be close in space or
    score quality).
  • Reduced by the noisy algorithm tends to
    jiggle out of poor local optima.
  • Require multiple independently-started runs to
    solve you simply take the best motif set found
    (but no guarantee!).

26
Run graphic sampler example
Write a Comment
User Comments (0)
About PowerShow.com