Title: Bioinformatics
1Lecture 10
Bioinformatics
- Finding signals and motifs in DNA and proteins
- Expectation Maximization Algorithm
- MEME
- The Gibbs sampler
2Finding signals and motifs in DNA and proteins
- An alignment of sequences is intrinsically
connected with another essential task, which is
finding certain signals and motifs (highly
conservative ungapped blocks) shared by some
sequences. - A motif is a sequence pattern that occurs
repeatedly in a group of related protein or DNA
sequences. Motifs are represented as
position-dependent scoring matrices that describe
the score of each possible letter at each
position in the pattern. - Another related task is searching biological
databases for sequences that contain one or more
of known motifs. - These objectives are critical in analysis of
genes and proteins, as any gene or protein
contains a set of different motifs and signals.
Complete knowledge about locations and structure
of such motifs and signals leads to a
comprehensive description of a gene or protein
and indicates at a potential function.
3The eMOTIF method of motif analysis
- eMotif is very useful method of identifying
motifs in proteins - MSA of a particular set of proteins is submitted
to eMotif, which essentially searches for
consensus sequence(s) and identifies the
conservative motifs. - The probability of a motif is estimated from the
frequencies of the individual amino acids in the
SwissProt DB as a product of probabilities of
each position in the consensus - The result could be as follows This motif
matches 25 out of the 30 sequences supplied. It
will match 1 in 10 19 random sequences, or less
than 1 sequence in the current SWISS-PROT
database. - Then a motif can be searched in the Swiss-Prot DB
4eMOTIF
True positives
5eMOTIF search of sequences with certain emotif
in the DB
6Expectation Maximization (EM) Algorithm
- This algorithm is used to identify conserved
areas in unaligned DNA and proteins. - Assume that a set of sequences is expected to
have a common sequence pattern. - An initial guess is made as to location and size
of the site of interest in each of the sequences
and these parts are loosely aligned. - This alignment provides an estimate of base or
aa composition of each column in the site. - The EM algorithm consists of the two steps,
which are repeated consecutively. - Step 1, the expectation step, the
column-by-column composition of the site is used
to estimate the probability of finding the site
at any position in each of the sequences. These
probabilities are used to provide new information
as to expected base or aa distribution for each
column in the site. - Step 2, the maximization step, the new counts
for bases or aa for each position in the site
found in the step 1 are substituted for the
previous set.
7Expectation Maximization (EM) Algorithm
OOOOOOOOXXXXOOOOOOOOOOOOOOOOXXXXOOOOOOOO o o o o
o o o o o o o o o o o o o o o o o o o
o OOOOOOOOXXXXOOOOOOOO OOOOOOOOXXXXOOOOOOOO
IIII IIIIIIII IIIIIII
Columns defined by a preliminary alignment of
the sequences provide initial estimates of
frequencies of aa in each motif column
Columns not in motif provide background
frequencies
Bases Background Site column 1 Site column 2
G 0.27 0.4 0.1
C 0.25 0.4 0.1
A 0.25 0.2 0.1
T 0.23 0.2 0.7
Total 1.00 1.00 1.00
8Expectation Maximization (EM) Algorithm
A
B
The resulting score gives the likelihood that the
motif matches positions A, B or other in seq 1.
Repeat for all other positions and find most
likely locator. Then repeat for the remaining
seqs.
9EM Algorithm 1st expectation step calculations
- Assume that the seq1 is 20 bases long and the
length of the site is 20 bases. - Suppose that the site starts in the column 1 and
the first two positions are A and T. - The site will end at the position 20 and
positions 21 and 22 do not belong to the site.
Assume that these two positions are A and T also.
- The Probability of this location of the site in
seq1 is given by - Psite1,seq1 0.2 (for A in position 1) x 0.7
(for T in position 2) x Ps (for the next 18
positions in site) x 0.25 (for A in first
flanking position) x 0.23 (for T in second
flanking position x Ps (for the next 78 flanking
positions). - The same procedure is applied for calculation of
probabilities for Psite2,seq1 to Psite78,
seq1, thus providing a comparative set of
probabilities for the site location. - The probability of the best location in seq1,
say at site k, is the ratio of the site
probability at k divided by the sum of all the
other site probabilities. - Then the procedure repeated for all other
sequences.
10EM Algorithm 2nd optimisation step calculations
- The site probabilities for each seq calculated
at the 1st step are then used to create a new
table of expected values for base counts for each
of the site positions using the site
probabilities as weights. - Suppose that P (site 1 in seq 1) Psite1,seq1 /
(Psite1,seq1 Psite2,seq1 Psite78,seq1 )
0.01 and P (site 2 in seq 1) 0.02. - Then this values are added to the previous table
as shown in the table below. - This procedure is repeated for every other
possible first columns in seq1 and then the
process continues for all other sequences
resulting in a new version of the table. - The expectation and maximization steps are
repeated until the estimates of base frequencies
do not change.
Bases Background Site column 1 Site column 2
G 0.27 0.4 0.1
C 0.25 0.4 0.1
A 0.25 0.2 0.01 0.1
T 0.23 0.2 0.7 0.02
Total/ weighted 1.00 1.00 1.00
11(No Transcript)
12Multiple EM for Motif Elicitation - MEME
13MEME Summary Line
- This line gives the width (width), number of
occurrences in the training set (sites), log
likelihood ratio (llr) and E-value of the
motif. Each motif describes a pattern of a fixed
width and no gaps are allowed in MEME motifs.
MEME numbers the motifs consecutively from one as
it finds them. MEME usually finds the most
statistically significant (low E-value) motifs
first. - The statistical significance of a motif is based
on its log likelihood ratio, its width and number
of occurrences, the background letter
frequencies (given in the command line summary),
and the size of the training set. - The E-value is an estimate of the expected
number of motifs with the given log likelihood
ratio (or higher), and with the same width and
number of occurrences, that one would find in a
similarly sized set of random sequences. (In
random sequences each position is independent
with letters chosen according to the background
letter frequencies.) - The log likelihood ratio is the logarithm of
the ratio of the probability of the occurrences
of the motif given the motif model (likelihood
given the motif) versus their probability given
the background model (likelihood given the null
model). (Normally the background model is a
0-order Markov model using the background letter
frequencies, but higher order Markov models may
be specified via the -bfile option to MEME.) - Clicking on the buttons to the left of the motif
summary line takes you to the previous motif (P)
or next motif (N).
14MEME Summary Line
15MEME
MOTIF 1 width 26 sites 5 llr
244 E-value 5.0e-006
16MEME
17The Gibbs Sampler
- The Gibbs sampler algorithm is slightly
different from the EM approach. The method also
searches for the statistically most probable
motifs and can find the optimal width and the
number of motifs in each sequence. - The method iterates through two steps. In the
first step a random start position for the motif
is chosen for all sequences but for one. These
seq. are then aligned and used to find an initial
guess of the motif. - The objective of the next step is to find the
most probable pattern common to left out sequence
(and on the next iterations to all of the
sequences) by sliding them back and forth until
the ratio of the motif probability to the
background probability is a maximum. - Then the next sequence is left out and the
process is repeated until the residue frequencies
in each motif do not change. The number of
iterations may range from several hundred to
several thousand. - Several additional statistical procedure are
used to improve the performance of the algorithm.
The Gibbs sampler was used to align sequences
with very little sequences similarity.
18Steps of the Gibbs sampler algorithm
A. Estimate the aa or base frequencies in the
motif columns of all but the 1 sequence. Also
obtain background
Motif
xxxxxxxMxxxxxxx
xxxxxxxMxxxxxxx xxxxxxxxxxMxxxx
xxxxxxxxxxMxxxx xMxxxxxxxxxxxxx
xMxxxxxxxxxxxxx xxxxxxxxxxxxxxM
xxxxxxxxxxxxxxM xxxxxMxxxxxxxxx
xxxxxMxxxxxxxxx Random start
Location of motif in each sequence
provides positions chosen first
estimate of motif composition
All sequences except the outlier
x is equal to n seq. positions M indicates
random location of the motif in each seq. -
indicates initially aligned motif positions
B. Use the estimate from A to calculate the
ratio of probability of motif to background score
at each position in the left out
sequence. This ratio for each possible location
in the sequence is the weight of the position.
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx M - gt
M - gt M
- gt M - gt
M - gt C. Choose a new
location for the motif in the left out sequence
by a random selection using the weights to bias
the choice. xxxxxxxxxxMxx
Estimated locations of the motif in left out
sequence D. Repeat steps A to C gtgttimes
The outlier sequence