Title: Predicting cellular localization
1Predicting cellular localization
2Eukaryotic protein localization
3Why localize?
- Subcellular localization is a key functional
characteristic of proteins. - To co-operate for a common physiological function
(metabolic pathway, signal transduction cascade,
structural associate etc.), proteins must be
localized in the same cellular compartment.
http//mendel.imp.univie.ac.at/CELL_LOC/
4Correct Localization is required for
pathway/complex formation
- A set of many co-operating proteins is
responsible for a physiological function
(metabolic pathway, signal transduction cascade,
structural associate etc.). - Subcellular localization is an essential
characteristic for this level. - For proper functioning, the protein has to be
translocated to the correct intra- or
extracellular compartments in a soluble form or
attached to a membrane.
http//mendel.imp.univie.ac.at/CELL_LOC/
5Computer-Aided Approaches for the Assignment of
Subcellular Localization
- Automatic, computer-aided selection methods are
clearly the only way to identify interesting
attractive target proteins among the haystack of
new gene sequence data. - One of the helpful decision criteria is the
probable subcellular localization of the gene
products. - For example, in a search for virulence factors of
pathogenic bacteria or easily accessible entry
points for pharmaceutical drugs, extracellular
proteins are good candidates
http//mendel.imp.univie.ac.at/CELL_LOC/
6Computer-Aided Approaches for the Assignment of
Subcellular Localization
- For a primary screening of gene sequences, the
first step is a general classification into
intracellular, extracellular, membrane-related
(both with transmembrane regions and with lipid
anchors) and viral proteins. - In the case of Eukaryotes, intracellular location
is desirable to be further detailed with respect
to organelles (mitochondrium, chloroplast,
endoplasmatic reticulum and Golgi apparatus,
nucleus).
http//mendel.imp.univie.ac.at/CELL_LOC/
7Predicting subcellular localization by homology
with characterized proteins
- Subcellular localization can often be assigned by
searching for homologous sequences. - This is an easy task for a few new proteins but
very difficult for thousands of sequences
contained in new genomes. - Even with the most advanced retrieval systems and
relying on the well-annotated SWISS-PROT, it is
impossible to get exhaustive classifications with
respect to subcellular localization.
http//mendel.imp.univie.ac.at/CELL_LOC/
8Prediction method 2 analysis of sequence
properties
- First attempts to classify proteins with respect
to cellular localization based on amino acid
sequence properties Nishikawa and Ooi (J.Biochem.
1982) - amino acid composition, disulphide bonds, the
secondary structural class related to function
and localization - Early results were promising, but based on a
small sample.
http//mendel.imp.univie.ac.at/CELL_LOC/
9Prediction by signal peptide detection
- Some proteins have sequence signals that
determine their translocation to organelles or
outside the cell - Claros et al. Curr.Op.Struct.Biol. (1997).
- These patterns are not clear cut, especially for
the intracellular organelle targeting peptides - prediction accuracy is limited
- Nielsen et al. Prot.Eng. (1997) v.10, 1
- Combinations of compositional and signal sequence
analyses have been used in expert systems for the
prediction of cellular localization - Nakai Kanehisa Genomics (1992)
- In general not systematic and not rigorously
tested
http//mendel.imp.univie.ac.at/CELL_LOC/
10Extracting information from sequence
- Signal peptides short sequences in the protein
used to target the protein for specific cellular
compartments. - Signal patches (clusters of amino acids in close
proximity in 3D structure, but distant in primary
sequence) are also found - Examination of amino acids at structure surface
can be particularly helpful subtle preferences
of different amino acids for different
environments
11Trans-membrane helix prediction
12(No Transcript)
13Helical membrane proteins
- Key components in cell-cell signalling
- Mediate transport of ions and solutes across
membrane - Crucial for recognition of self
- Major class of drug targets
- More than 50 of prescription drugs act on GPCRs
(G-protein coupled receptors) - Multi-billion dollar industry
14Many predicted few known
- Solved structures available for very few membrane
proteins - Predicted 10K helical membrane proteins in human
genome (25 of genome!)
Chen and Rost, 2002
15Helical membrane proteins challenge bioinformatics
- Very little info about 3D structures
- Very hard to crystallize
- Hardly traceable by nuclear magnetic resonance
(NMR) spectroscopy - Relatively easy to identify (rough) location of
helices through low-resolution experiments - C-terminal fusion with indicator proteins
- Antibody binding
Chen and Rost, 2002
16Concepts for predicting TM helix location and
topology
- Hydrophobicity scales provide simple criteria for
prediction - TM helices are predominantly non-polar
- TM helix length between 12-35 aa
- Globular regions between membrane helices
typically shorter than 60 aa - Positive inside rule von Heijne
- Connecting loop regions on inside have more
positive charge than loop regions on outside
Chen and Rost, 2002
17Hydrophobicity scales
- Kyte and Doolittle (20 yrs ago)
- Hydropathy scale, moving window approach
- Window of 19 residues discriminated best between
membrane and globular - Other work equally successful
- Drawback methods fail to discriminate between
membrane regions and highly hydrophobic globular
segments
Chen and Rost, 2002
18Other clues
- Amino acid preferences for membrane and
non-membrane proteins - Training data for methods derived from proteins
identified as containing TM helices, as well as
other secondary structure types - Higher accuracy
Chen and Rost, 2002
19Including topology helps
- TopPred (von Heijne, 1992)
- Topology prediction, using hydrophobicity
analysis, possible topologies ranked by
positive-inside rule - SOSUI (Hirokawa et al, 1998)
- Combined KD hydropathy, amphiphilicity, relative
and net charges, protein length
Chen and Rost, 2002
20Including homology helps
- Alignment of homologs known to help secondary
structure prediction (Rost and Sander, 1993) - Note for 20-30 of proteins in any genome, no
identifiable homologs can be found! - PHDhtm first method using homology info for
membrane prediction - Uses neural networks, DP, multiple alignment
- one of the most accurate prediction methods
Chen and Rost, 2002
21Including homology helps
- TMAP (Persson and Argos, 1996)
- Derived amino acid propensities from known TMs
- 4-residue caps of membrane helices
- 21 residue TM segments
- Found at outside of membrane N D G F P W Y V
- Found mostly inside A R C K
- Used these propensities to improve prediction
Chen and Rost, 2002
22Grammatical rules
- TMHMM pioneered building models of predicted
membrane proteins in one consistent methodology - Sonnhammer et al 1998, Krogh et al 2001
- Similar concept implemented in HMMTOP
- Tusnady and Simon, 1998
- MEMSAT similar to HMMTOP
- Jones et al, 1994
Chen and Rost, 2002
23Topology questions
- The topology of a TM protein indicates its
orientation with respect to the membrane - which regions are outside (extracellular) and
which are cytoplasmic - Predicted topologies turn out to be wrong roughly
as often as theyre correct
Chen and Rost, 2002
24Sequence information aiding TM recognition
- Hydrophobic stretches (for lipid bilayer)
- Positive inside rule
- Von Heijne 1986, 1994
- Abundance of positively charged residues
- Improved predictions through use of
- sliding windows
- Multiple alignment
- Neural networks
Chen and Rost, 2002
25Errors in TM prediction
- Under-prediction (False negative)
- Over-prediction (False positive)
- False merge
- two adjacent helices predicted to be one helix
- False split
- One long helix predicted to be two
- Inexact placement of helices
Chen and Rost, 2002
26Prediction accuracy (1)
- Performance accuracy overestimated significantly!
- developers have overrated their methods by
15-50 Chen et al, unpublished - Why do developers overestimate their method
accuracy? - Validation performed on proteins closely related
to training sequences (and thus not indicative of
performance on novel sequences)
Chen and Rost, 2002
27Prediction accuracy (2)
- Membrane helices are not entirely conserved
across species - Implies that even related proteins may have
different topologies ( TM helices, orientation)
and perform different cellular functions - N.B. There is no indication that the authors
meant to imply that proteins that are globally
alignable have differences in their TM domain
locations or numbers - Measures of accuracy of prediction not comparable
across methods, due to lack of standard benchmark - Benchmark dataset now available at EBI
Chen and Rost, 2002
28Chen et al findings
- Most TM methods get the number of helices right
for most membrane proteins - 86 of TMH residues predicted by best methods
- 70-75 of proteins get all TM helices predicted
correctly by top methods - Topology correct for only half of all proteins
Chen and Rost, 2002
29Prediction accuracy (4)
- Some papers have claimed that simple
hydrophobicity scales are as accurate as more
sophisticated methods - Chen et al disagree
Chen and Rost, 2002
30Prediction accuracy (5)
- All methods confuse membrane helices with signal
peptides - Best separation provided by ALOM2 (Nakai and
Kanehisa) - Optimized to sort proteins into classes of
sub-cellular localization
Since Rosts paper, the Phobius server was
developed to integrate TM and signal peptide
prediction http//www.ebi.ac.uk/Tools/phobius/inde
x.html
Chen and Rost, 2002
31Prediction accuracy (6)
- Most methods wrongly predict membrane helices in
globular proteins - Most methods overestimate their ability to
distinguish between globular and membrane
proteins
Chen and Rost, 2002
32Emerging and future developments
- Improved prediction by averaging over many
methods (I.e., consensus approaches) - Promponas and colleagues CoPreTHi combined 7
methods, requiring 3 to agree - Nilsson et al, 2000, used 5 methods
- Accuracy correlated with number of methods
agreeing
Chen and Rost, 2002
33Emerging and future developments
Chen and Rost, 2002
- Amphiphilic (aka amphipathic) alpha helix
identification can improve prediction - Helical-membrane and signal peptide predictions
must be combined explicitly - Best signal peptide prediction tool is SignalP
(Nielsen et al 1997) - PSORT, HMMTOP and THHMM integrate these
predictions - More thorough combination is still missing
Except, of course, for Phobius, released since
this paper
34Emerging and future developments
- Databases of TM proteins being produced and
curated - Membrane-specific substitution matrices improve
database search for TM proteins - Current substitution matrices based on globular
proteins - Henikoff and Henikoff have membrane-helix-specific
substitution matrix PHAT
Chen and Rost, 2002
35Sequence conservation in TM domains
- Residues on helix-helix interface tend to be more
conserved than those facing the lipid bilayer - Conservation in TM helices greater than
structurally variable regions but not as
significant as enzyme active sites and other
functionally critical regions (KS observation)
36More data from structural studies of TM proteins
- Solved membrane protein structures have also
shown that helical propensities are different in
the membrane. - Glycine and proline, which are thought to be
helix-breakers in soluble proteins, occur in the
transmembrane helices of cytochrome c oxidase - Tsukihara et al, 1995.
- Studying known structures has revealed that
aromatic residues are often in the bilayer
interface, possibly anchoring the transmembrane
helix in the bilayer - Pawagi et al, 1994.
37More data from structural studies
- Serine and threonine can satisfy hydrogen bond
donors and acceptors by hydrogen bonding to
backbone carbonyls, making membrane localization
favorable (Engelman et al 1986) - Analysis of solved membrane proteins show TM
length ranges from 14-36 aa (varying due to
variations in lipid bilayer width) - Canonical alpha helix prediction methods derived
from soluble proteins are not as effective at
predicting TM-located helices
38TMHMM provides a grammar to parse sequences into
subregions
39TMHMM author findings
- TMHMM correctly predicts 9798 of the
transmembrane helices. - TMHMM can discriminate between soluble and
membrane proteins with both specificity and
sensitivity better than 99 - although the accuracy drops when signal peptides
are present - This high degree of accuracy allowed authors to
predict reliably integral membrane proteins in a
large collection of genomes. - Based on these predictions, authors estimate that
2030 of all genes in most genomes encode
membrane proteins - which is in agreement with previous estimates.
- Proteins with Nin-Cin topologies are strongly
preferred in all examined organisms - except Caenorhabditis elegans, where the large
number of 7TM receptors increases the counts for
Nout-Cin topologies.
40Aspects of model
- Specialized modeling of various regions
- Helix caps
- Middle of helix
- Regions close to membrane
- Globular domains (all modeled identically)
- TM amino acid stats derived from known TM domains
41Training data
42Signal peptide prediction
43(No Transcript)
44(No Transcript)
45(No Transcript)
46Chloroplast transit peptides are hard to detect
47(No Transcript)
48(No Transcript)
49A plant GPCR??
Arabidopsis Thaliana GCR2
50only one Arabidopsis putative GPCR protein
(GCR1) has been characterized in plants (1720),
and no ligand has been defined for any plant GPCR
51Transmembrane structure prediction suggests that
GCR2 is a membrane protein with seven
transmembrane helices
-Despite bold claim of 7TM GPCR, only two
prediction servers used, no confidence values
indicated, and figure ended up in Supplemental
Material!
DAS
TMPred
52(No Transcript)
53Liu et al predicted GCR2 as a seven-transmembrane
protein (7TM), using the TMpred and DAS
programs, but did not report score thresholds to
evaluate the confidence of these predictions.
TMpred and DAS are known to erroneously predict
transmembrane helices within soluble proteins
(55 and 83 false positive rates, respectively)
GCR2 alignment with LanC superfamily (non 7-TM
GPCR)
54(No Transcript)
55(No Transcript)
56We initially predicted that GCR2 was a
seventransmembrane protein using TMpred and
DAS software programs (2). We further used 12
distinct software programs to predict the
topological structure of GCR2 and found that 9 of
them (TMHMM, SOUSI, and DAS TMfilter
excluded) showed that GCR2 is a transmembrane
protein with various numbers of transmembrane
domains. TMHMM has underpredicted
transmembrane domains in many instances (3), and
the only other reported GPCR in Arabidopsis, GCR1
(4), was predicted to be a three-transmembrane
protein by SOSUI. In addition, about 14 of
known transmembrane proteins (established by
crystal structure or biochemical evidence) cannot
be correctly predicted by available software (3).
Thus, computational prediction of membrane
proteins is not yet a mature science and mainly
serves to generate hypotheses for experimental
testing
57(No Transcript)
58Discussion
Based on the evidence presented who do you think
is right? TM prediction and validation is
challenging both bioinformaticians
and Experimentalists alike!
59TM/Signal Peptide/Localization prediction servers
- Phobius http//phobius.sbc.su.se
- -combined topology and signal peptide prediction
-
- TMHMM http//www.cbs.dtk.dk/services/TMHMM/
- -TM helix prediction
- TargetP http//www.cbs.dtu.dk/services/TargetP/
- -subcellular localization of eukaryotic proteins
- SignalP http//www.cbs.dtu.dk/services/TargetP/
- -predicts the presence and location of signal
peptide cleavage sites
60Summary points
- Protein localization is a critical aspect of
protein function - Methods for predicting localization can
over-estimate their expected accuracy - Datasets used in validation typically differ from
one method to the next, so results are not
comparable - Expect false positive and false negative
predictions - Consensus prediction using various types of
information and predictions are your best bet for
improving accuracy