Title: V2 - Predicting TM helices from sequence
1V2 - Predicting TM helices from sequence
- Review
- 20 - 25 of all genes code for transmembrane
proteins - High energetic cost of dehydrating the peptide
bond ? - The amino side chains in the TM region must be
non-polar - The polar peptide bond must self-satisfy its
H-bonding potential - Therefore, polypeptide chains form alpha-helices
or beta-sheets in the hydrophobic core of the
membrane.
Jones, Bioinformatics 23, 538 (2007)
2TM protein topology prediction
- Relies on two major topological features
- TM helices are generally formed by hydrophobic
stretches - Gunnar von Heijne observed a bias towards
positively charged residues in the regions
flanking the hydrophobic stretches, - especially on the intracellular side of the
membrane - This feature has been termed positive-inside
rule. - Short loops are found to be enriched with Lys
and Arg residues on the intracellular side and
depleted on the outside.
Jones, Bioinformatics 23, 538 (2007)
3TM prediction Kyte-Doolittle hydrophobicity
scale (1982)
Assign hydropathy value to each amino acid. Use
sliding-window to identify membrane regions.
Sum the hydrophobicity scale over all w
residues in the window of length w. Use
threshold T to assign segment as predicted
membrane helix. w 19 residues could best
discriminate between membrane and globular
proteins. Threshold T gt 1.6 was suggested for
the average over 19 residues.
4First prediction of TM topology TopPred
TopPred (von Heijne 1992) predicts the complete
topology of membrane proteins by using -
hydrophobicity analysis - automatic generation of
possible topologies - ranking these topologies by
the positive-inside rule.
TopPred uses a particular sliding trapezoid
window to detect segments of outstanding
hydrophobicity. The two bases of the trapezoid
are 11 and 21 residues long. TopPred chooses
thresholds by considering a segment as TM helix
that yielded the optimal difference between the
number of positively charged residues at the
inside and at the outside.
5MEMSAT uses dynamic programming
MEMSAT (Jones, 1994) implemented statistical
tables (log likelihoods) compiled from
well-characterized TM proteins and a dynamic
programming algorithm to recognize membrane
topology models by expectation maximisation. Expe
ctation maximization attempts to search for the
model which best explains the given data. Given a
function which calculates the total probability
for the match of a given model with a given
sequence, the resulting model from expectation
maximization should correspond to the maximum of
this function. MODEL DEFINITION The first
requirement for expectation maximization is the
definition of a model. In the case of
transmembrane prediction such a model includes
parameters for - the number of membrane-spanning
segments, n, - the topology, t (N-terminus in or
out), - the length, I, and the location, i, in
the sequence of each segment.
6MEMSAT structural states
Residues are classified as being one of 5
structural states Li inside loop Lo outside
loop Hi inside helix end Hm helix
middle Ho outside helix end. Helix end caps are
arbitrarily defined to span over 4 adjacent
residues (one helical turn). Idea (1) Compile
propensities of amino acids for 5 states. (2)
Calculate score of relating given sequences to a
predicted topology. (3) Find optimal score by
dynamic programming.
7MEMSAT propensities
For each of the 5 structural classes, log
likelihood ratios for each of the 20 amino acids
were calculated
where pi is the relative frequency of occurrence
(or fraction) of amino acid i in all the
sequences in the data set, and qi is the relative
frequency of occurrence of amino acid i in a
particular structural class. A positive score
indicates a higher than expected frequency for a
given amino acid to be found in a particular
structural class and a negative score a lower
than expected frequency. To circumvent the
problem of classifying globular domains as loops,
loops longer than 100 residues are not classified
as loops, and are ignored in the calculation of
qi values. These oversized loops are, however,
included in the calculation of the overall
relative frequencies of occurrence pi.
8Memsat
9Memsat
10Memsat Topology prediction
Using the propensities shown in Figures 2 and 3,
it is possible to calculate a score relating to
the compatibility of a given sequence with a
given topology and secondary structure. We need
to find the structural model with the best
score. For a sequence of length m, and a given
transmembrane topology (n, t ), there are
approximately 9n ((m - 21n) / n)n possible
models. E.g. for a 7-helix TM topology and a
sequence of length 250, this gives ca. 7 ? 1014
different models. Clearly a brute-force approach
is inappropriate.
11Memsat Topology prediction
Despite the apparent complexity of the problem,
the score for a particular residue depends solely
on the identity of the residue, and its
structural environment (Li, Lo, Hi, Hm, or Ho).
Therefore, as a result of this single
dimensionality, it is straightforward to
formulate a dynamic programming solution to the
problem, which will ensure that the global
optimum model will be found every time. The
overall problem of determining the optimal
position and length of n TM helices in a sequence
of length m is divided into n subproblems
determine the optimal position and length of a
single TM helix along with its associated
C-terminal coil segment.
12Memsat Topology prediction
sil score associated with a TM helix of length
l at position i in the given sequence. This
score is calculated according to the diagram
shown in Figure 1, where the helix is divided
into three sections (two caps of length 4, and a
center region of length l - 8). Whether the cap
and its associated loop are inside or outside
depends on the initially specified membrane
topology. In order to find the best set of silj
, MEMSAT uses a recursive algorithm almost
identical to the algorithms used for pairwise
sequence alignment (Needleman Wunsch, 1970).
13Insertion Needleman-Wunsch algorithm
- Trace-back yields best alignment in the matrix.
- start in bottom right corner and follow arrows
til left corner. - this gives the best alignment of these two
sequences - COELACANTH
- -PELICAN--
C O E L A C A N T H
0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
P -1 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
E -2 -2 -2 -1 -2 -3 -4 -5 -6 -7 -8
L -3 -3 -3 -2 0 -1 -2 -3 -4 -5 -6
I -4 -4 -4 -3 -1 -1 -2 -3 -4 -5 -6
C -5 -3 -4 -4 -2 -2 0 -1 -2 -3 -4
A -6 -4 -4 -5 -3 -1 -1 1 0 -1 -2
N -7 -5 -5 -5 -4 -2 -2 0 2 1 0
13
2. Vorlesung WS 2007/08
Softwarewerkzeuge der Bioinformatik
14Memsat dynamic programming approach
Define matrix number of TM segments ? length of
total sequence Need also third parameter length
of every TM helix The pathway with highest score
will then contain the correct number of TM
helices, each with the correct length, at the
correct position. Expectation maximization
aspect Optimize the log likelihood scores for
the residue propensities
15Memsat Topology prediction
Define a score matrix Sij (i1 ... n, j1...m)
as
where A is the minimum length of a loop
segment. In this example, i varies from 1 to 3
helices, and the sequence j has a length of 115
minus the length of a TM helix. silj score of
TM helix number i of length l at position j in
the given sequence. The second maximum considers
the following helices.
16Memsat Topology prediction
Having computed the score matrix S, the highest
value in the column j 1 is the score for the
best path through the matrix, which represents
the optimal lengths and positions of m TM helices
in the given sequence. The highest value in
column 2 is the optimal path score for m -1
helices, but with inverted topology, and this can
be extended to the other columns.
17Memsat Topology prediction
In this way, only two score matrices need to be
calculated to evaluate all possible membrane
topologies for a given case one with helix 1
(column 1) defined with the N-terminus on the
inside, and the other with the helix 1 N-terminus
on the outside. If we calculated two matrices
for m 7, one matrix would therefore provide
optimal paths for topologies 7, -6, 5, -4, 3,
-2, and 1, and the other would provide paths for
-7, 6, -5, 4, -3, 2, and -1 (where ve
indicates the N-terminus inside). The
appropriate score for the N-terminal loop must be
added to the appropriate matrix values.
18Memsat Topology prediction
19MEMSAT3 (Jones, 2007)
Replace log likelihood propensities by the
prediction of a neural network classifier similar
to PSIPRED (also developed by Jones). Use
feed-forward neural network comprising 399 inputs
(19 ? 21), 15 hidden units and 4 output units.
Input sequence window of 19 residue positions
and 21 inputs per residue. This encoding is the
same as that used by the PSIPRED secondary
structure prediction method (Jones, 1999), though
with a slightly longer window in this case. A
number of different output encodings were
considered, but it was decided that a minimal
encoding of just four outputs would be preferable
for optimal neural network training. The four
output targets are cytoplasmic (Oin),
non-cytoplasmic (Oout ), transmembrane segment
(Otm) and signal peptide (Osig). Optimize neural
network weights on training data set.
20MEMSAT3 (Jones, 2007)
To calculate the most probable topology based on
the neural network output, the MEMSAT dynamic
programming algorithm (Jones et al., 1994) was
used. In the original MEMSAT method, five
different regions of a TM protein were
defined. At every position in the target
sequence, the four neural network outputs were
combined as follows to generate scores for the
MEMSAT topology search
where Ox represents the raw neural network output
x.
21MEMSAT3 (Jones, 2007)
MEMSAT3 easily predicts long TM helices, but has
problems with short, half-spanning
helices. Also, the recognition of signal
peptides should be improved further.
22Using evolutionary information
It is known from predicting secondary structures
of globular proteins that using multiple sequence
alignment information improves prediction
accuracy significantly. PHDtm predict location
and topology of TM helices by a system of neural
networks. Was later combined with dynamical
programming.
23Using grammatical rules
The lipid bilayer constrains the structure of the
membrane-passing regions of proteins in many
ways. TMHMM (Sonnhammer et al. 1998, Krogh et al.
2001) and HMMTOP (Tusnady Simon 1998, 2001)
implement Hidden Markov Models.
24Using grammatical rules
TMHMM uses cyclic model with 7 states for - TM
helix core - TM helix caps on the N- and
C-terminal side - non-membrane region on the
cytoplasmic side - 2 non-membrane regions on the
non-cytoplasmic side (for short and long loops to
account for different membrane insertion
mechanism) - a globular domain state in the
middle of each non-membrane region
25TMHMM types of errors
26Availability of prediction methods.
Many of these servers are also available through
a Meta-Server META-PP at the site of Burkhard
Rost.
27Most methods get number of helices right
All methods based on advanced algorithms tend to
underestimate TM helices obs gt prd.
a Data set Sequence-unique subset of 36
high-resolution TM helical proteins from PDB.
This is the largest subset of all 105
high-resolution membrane chains, which fulfils
the condition that no pair in the set has
significant sequence similarity as defined in
Rost (1999). b Methods c Per-segment accuracy
Qok percentage of proteins for which all TM
helices are predicted correctly (allowed
deviation of up to 3 residues), Qobshtm
percentage of all observed helices that are
correctly predicted, Qprdhtm percentage of all
predicted helices that are correctly predicted,
TOPO percentage of proteins for which the
topology (orientation of helices) is correctly
predicted (empty for methods that do not predict
topology). d Per-residue accuracy Q2 percentage
of correctly predicted residues in two-states
membrane helix / non-membrane helix, Qobs2T
percentage of all observed TMH helix residues
that are correctly predicted, Q prd2T percentage
of all predicted TMH helix residues that are
correctly predicted, Qobs2N percentage of all
observed non-TMH helix residues that are
correctly predicted, Qprd2N percentage of all
predicted non-TMH helix residues that are
correctly predicted. e ERROR the estimates for
per-segment accuracy resulted from a bootstrap
experiment with M 100 and K 18 the estimates
for per-residue accuracy were obtained by
standard deviations over Gaussian distributions
for the respective score. f Numbers in italics
two standard deviations below the numerically
highest value in each column (set in bold
letters). NOTE all methods are tested on the
same set of proteins. However, the numbers are
NOT from a cross-validation experiment, ie some
methods may have used some of the proteins for
training. Generally, newer methods are more
likely to be overestimated than older ones. In
particular, HMMTOP2, TMHMM1, and WW have been
developed using ALL the proteins listed here.
28Future directions
Meta servers yield improved predictions. gt 90
correct topologies can be obtained by a simple
majority vote between the results of various
methods. TM helix prediction and signal peptide
prediction should be combined Useful databases
for particular families of TM proteins and
sequence motifs e.g. GPCR database Membrane-speci
fic substitution matrices improve database
searches e.g. PHAT by Henikoff Henikoff
improved alignments of TM proteins Account for
helix-helix interactions.
29TopPred?G
Use ?G values from Hessa predictor and a variant
of the TopPred method. Briefly, a sliding window
of fixed length (l 21 residues) is scanned
across the protein sequence, and ?Gapp values are
calculated for each sequence position.
Here l is the length of the TM segment. ?Gappaa(i)
is the contribution of amino acid aa in position
i. The expression under the square root is the
hydrophobic moment.
Bernsel. PNAS 105, 7177 (2008)
30TopPred?G
(1) all minima lt ?Glow are identified and marked
as certain TM segments. All minima above
?Glow but below a second higher cutoff value
(?Ghigh ) are marked as putative TM segments.
(2) all possible topologies, including all
certain TM segments and either including or
excluding each of the putative TM segments, are
generated, and the topology that best complies
with the positive-inside rule is chosen as the
final prediction. The parameters ?Glow and
?Ghigh were optimized over a benchmark set of
known transmembrane topologies
Bernsel. PNAS 105, 7177 (2008)
31TopPred?G
Bernsel. PNAS 105, 7177 (2008)
32TopPred?G
Multi-sequence results are obtained with input
from multiple sequence alignments. TopPred?G
works as well as the best statistical
methods that include hundreds of optimized
parameters.
Bernsel. PNAS 105, 7177 (2008)
33TopPred?G
Generally, the missed helices have both
higher ?Gapp values (4.4 kcal/mol vs. 0.76
kcal/mol) and a higher fraction of surface area
in contact with the surrounding protein (67
buried vs. 54 buried surface area) than found
for the complete dataset. There is a strong
tendency for highly exposed helices to have lower
?Gapp values (Fig. 2A), indicating that such
helices need to be able to insert efficiently by
themselves, in the absence of stabilizing
interactions with surrounding protein.
Bernsel. PNAS 105, 7177 (2008)
34TopPred?G
Fig. 2A shows that a good part of the surface of
the high-?Gapp helices is buried already within
the same polypeptide chain. However, the mean
?Gapp for the most exposed group of helices
(020 buried) is considerably higher when
considering area buried against the chain than
against the whole protein complex, indicating
that a number of helices with relatively high
?Gapp are efficiently buried (20) only upon
oligomerization.
Bernsel. PNAS 105, 7177 (2008)
35TopPred?G
On the same line of thought, there should be more
opportunities for helixhelix interactions in
proteins containing many TM helices, and such
helices might thus be expected to be more polar
on average. Indeed, the mean ?Gapp increases with
the number of TM helices in the protein (Fig.
2B). Among the overpredicted helices, more than
half are reentrant regions, i.e., they partly
penetrate the membrane but enter and exit from
the same side.
Bernsel. PNAS 105, 7177 (2008)
36Summary
TM helices are typically continuous stretches of
mostly hydrophobic residues. Simple methods that
sum up hydrophobicities work okay but not really
well. Advanced methods include additional
features such as the positive-inside rule. The
currently most successful methods are based on
Hidden Markov Models or Neural Networks. Evaluati
ng performance accuracy should be done using
carefully separated training and test sets. It
is possible to discriminate signal peptides and
TM helices, e.g. Octopus. New method TopPred?G
utilizes exp. insertion free energies.