Predicting cellular localization

About This Presentation

Title:

Predicting cellular localization

Description:

... neural networks, DP, multiple alignment one of the most accurate prediction methods Including homology helps TMAP (Persson and Argos, 1996) ... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 61

Provided by: KimmenSj2

Category:

more less

Transcript and Presenter's Notes

Title: Predicting cellular localization

1
Predicting cellular localization

Bioe C144/C244
Fall 2010

2
Eukaryotic protein localization
3
Why localize?

Subcellular localization is a key functional
characteristic of proteins.
To co-operate for a common physiological function
(metabolic pathway, signal transduction cascade,
structural associate etc.), proteins must be
localized in the same cellular compartment.

http//mendel.imp.univie.ac.at/CELL_LOC/
4
Correct Localization is required for
pathway/complex formation

A set of many co-operating proteins is
responsible for a physiological function
(metabolic pathway, signal transduction cascade,
structural associate etc.).
Subcellular localization is an essential
characteristic for this level.
For proper functioning, the protein has to be
translocated to the correct intra- or
extracellular compartments in a soluble form or
attached to a membrane.

http//mendel.imp.univie.ac.at/CELL_LOC/
5
Computer-Aided Approaches for the Assignment of
Subcellular Localization

Automatic, computer-aided selection methods are
clearly the only way to identify interesting
attractive target proteins among the haystack of
new gene sequence data.
One of the helpful decision criteria is the
probable subcellular localization of the gene
products.
For example, in a search for virulence factors of
pathogenic bacteria or easily accessible entry
points for pharmaceutical drugs, extracellular
proteins are good candidates

http//mendel.imp.univie.ac.at/CELL_LOC/
6
Computer-Aided Approaches for the Assignment of
Subcellular Localization

For a primary screening of gene sequences, the
first step is a general classification into
intracellular, extracellular, membrane-related
(both with transmembrane regions and with lipid
anchors) and viral proteins.
In the case of Eukaryotes, intracellular location
is desirable to be further detailed with respect
to organelles (mitochondrium, chloroplast,
endoplasmatic reticulum and Golgi apparatus,
nucleus).

http//mendel.imp.univie.ac.at/CELL_LOC/
7
Predicting subcellular localization by homology
with characterized proteins

Subcellular localization can often be assigned by
searching for homologous sequences.
This is an easy task for a few new proteins but
very difficult for thousands of sequences
contained in new genomes.
Even with the most advanced retrieval systems and
relying on the well-annotated SWISS-PROT, it is
impossible to get exhaustive classifications with
respect to subcellular localization.

http//mendel.imp.univie.ac.at/CELL_LOC/
8
Prediction method 2 analysis of sequence
properties

First attempts to classify proteins with respect
to cellular localization based on amino acid
sequence properties Nishikawa and Ooi (J.Biochem.
1982)
amino acid composition, disulphide bonds, the
secondary structural class related to function
and localization
Early results were promising, but based on a
small sample.

http//mendel.imp.univie.ac.at/CELL_LOC/
9
Prediction by signal peptide detection

Some proteins have sequence signals that
determine their translocation to organelles or
outside the cell
Claros et al. Curr.Op.Struct.Biol. (1997).
These patterns are not clear cut, especially for
the intracellular organelle targeting peptides
prediction accuracy is limited
Nielsen et al. Prot.Eng. (1997) v.10, 1
Combinations of compositional and signal sequence
analyses have been used in expert systems for the
prediction of cellular localization
Nakai Kanehisa Genomics (1992)
In general not systematic and not rigorously
tested

http//mendel.imp.univie.ac.at/CELL_LOC/
10
Extracting information from sequence

Signal peptides short sequences in the protein
used to target the protein for specific cellular
compartments.
Signal patches (clusters of amino acids in close
proximity in 3D structure, but distant in primary
sequence) are also found
Examination of amino acids at structure surface
can be particularly helpful subtle preferences
of different amino acids for different
environments

11
Trans-membrane helix prediction
12
(No Transcript)
13
Helical membrane proteins

Key components in cell-cell signalling
Mediate transport of ions and solutes across
membrane
Crucial for recognition of self
Major class of drug targets
More than 50 of prescription drugs act on GPCRs
(G-protein coupled receptors)
Multi-billion dollar industry

14
Many predicted few known

Solved structures available for very few membrane
proteins
Predicted 10K helical membrane proteins in human
genome (25 of genome!)

Chen and Rost, 2002
15
Helical membrane proteins challenge bioinformatics

Very little info about 3D structures
Very hard to crystallize
Hardly traceable by nuclear magnetic resonance
(NMR) spectroscopy
Relatively easy to identify (rough) location of
helices through low-resolution experiments
C-terminal fusion with indicator proteins
Antibody binding

Chen and Rost, 2002
16
Concepts for predicting TM helix location and
topology

Hydrophobicity scales provide simple criteria for
prediction
TM helices are predominantly non-polar
TM helix length between 12-35 aa
Globular regions between membrane helices
typically shorter than 60 aa
Positive inside rule von Heijne
Connecting loop regions on inside have more
positive charge than loop regions on outside

Chen and Rost, 2002
17
Hydrophobicity scales

Kyte and Doolittle (20 yrs ago)
Hydropathy scale, moving window approach
Window of 19 residues discriminated best between
membrane and globular
Other work equally successful
Drawback methods fail to discriminate between
membrane regions and highly hydrophobic globular
segments

Chen and Rost, 2002
18
Other clues

Amino acid preferences for membrane and
non-membrane proteins
Training data for methods derived from proteins
identified as containing TM helices, as well as
other secondary structure types
Higher accuracy

Chen and Rost, 2002
19
Including topology helps

TopPred (von Heijne, 1992)
Topology prediction, using hydrophobicity
analysis, possible topologies ranked by
positive-inside rule
SOSUI (Hirokawa et al, 1998)
Combined KD hydropathy, amphiphilicity, relative
and net charges, protein length

Chen and Rost, 2002
20
Including homology helps

Alignment of homologs known to help secondary
structure prediction (Rost and Sander, 1993)
Note for 20-30 of proteins in any genome, no
identifiable homologs can be found!
PHDhtm first method using homology info for
membrane prediction
Uses neural networks, DP, multiple alignment
one of the most accurate prediction methods

Chen and Rost, 2002
21
Including homology helps

TMAP (Persson and Argos, 1996)
Derived amino acid propensities from known TMs
4-residue caps of membrane helices
21 residue TM segments
Found at outside of membrane N D G F P W Y V
Found mostly inside A R C K
Used these propensities to improve prediction

Chen and Rost, 2002
22
Grammatical rules

TMHMM pioneered building models of predicted
membrane proteins in one consistent methodology
Sonnhammer et al 1998, Krogh et al 2001
Similar concept implemented in HMMTOP
Tusnady and Simon, 1998
MEMSAT similar to HMMTOP
Jones et al, 1994

Chen and Rost, 2002
23
Topology questions

The topology of a TM protein indicates its
orientation with respect to the membrane
which regions are outside (extracellular) and
which are cytoplasmic
Predicted topologies turn out to be wrong roughly
as often as theyre correct

Chen and Rost, 2002
24
Sequence information aiding TM recognition

Hydrophobic stretches (for lipid bilayer)
Positive inside rule
Von Heijne 1986, 1994
Abundance of positively charged residues
Improved predictions through use of
sliding windows
Multiple alignment
Neural networks

Chen and Rost, 2002
25
Errors in TM prediction

Under-prediction (False negative)
Over-prediction (False positive)
False merge
two adjacent helices predicted to be one helix
False split
One long helix predicted to be two
Inexact placement of helices

Chen and Rost, 2002
26
Prediction accuracy (1)

Performance accuracy overestimated significantly!
developers have overrated their methods by
15-50 Chen et al, unpublished
Why do developers overestimate their method
accuracy?
Validation performed on proteins closely related
to training sequences (and thus not indicative of
performance on novel sequences)

Chen and Rost, 2002
27
Prediction accuracy (2)

Membrane helices are not entirely conserved
across species
Implies that even related proteins may have
different topologies ( TM helices, orientation)
and perform different cellular functions
N.B. There is no indication that the authors
meant to imply that proteins that are globally
alignable have differences in their TM domain
locations or numbers
Measures of accuracy of prediction not comparable
across methods, due to lack of standard benchmark
Benchmark dataset now available at EBI

Chen and Rost, 2002
28
Chen et al findings

Most TM methods get the number of helices right
for most membrane proteins
86 of TMH residues predicted by best methods
70-75 of proteins get all TM helices predicted
correctly by top methods
Topology correct for only half of all proteins

Chen and Rost, 2002
29
Prediction accuracy (4)

Some papers have claimed that simple
hydrophobicity scales are as accurate as more
sophisticated methods
Chen et al disagree

Chen and Rost, 2002
30
Prediction accuracy (5)

All methods confuse membrane helices with signal
peptides
Best separation provided by ALOM2 (Nakai and
Kanehisa)
Optimized to sort proteins into classes of
sub-cellular localization

Since Rosts paper, the Phobius server was
developed to integrate TM and signal peptide
prediction http//www.ebi.ac.uk/Tools/phobius/inde
x.html
Chen and Rost, 2002
31
Prediction accuracy (6)

Most methods wrongly predict membrane helices in
globular proteins
Most methods overestimate their ability to
distinguish between globular and membrane
proteins

Chen and Rost, 2002
32
Emerging and future developments

Improved prediction by averaging over many
methods (I.e., consensus approaches)
Promponas and colleagues CoPreTHi combined 7
methods, requiring 3 to agree
Nilsson et al, 2000, used 5 methods
Accuracy correlated with number of methods
agreeing

Chen and Rost, 2002
33
Emerging and future developments
Chen and Rost, 2002

Amphiphilic (aka amphipathic) alpha helix
identification can improve prediction
Helical-membrane and signal peptide predictions
must be combined explicitly
Best signal peptide prediction tool is SignalP
(Nielsen et al 1997)
PSORT, HMMTOP and THHMM integrate these
predictions
More thorough combination is still missing

Except, of course, for Phobius, released since
this paper
34
Emerging and future developments

Databases of TM proteins being produced and
curated
Membrane-specific substitution matrices improve
database search for TM proteins
Current substitution matrices based on globular
proteins
Henikoff and Henikoff have membrane-helix-specific
substitution matrix PHAT

Chen and Rost, 2002
35
Sequence conservation in TM domains

Residues on helix-helix interface tend to be more
conserved than those facing the lipid bilayer
Conservation in TM helices greater than
structurally variable regions but not as
significant as enzyme active sites and other
functionally critical regions (KS observation)

36
More data from structural studies of TM proteins

Solved membrane protein structures have also
shown that helical propensities are different in
the membrane.
Glycine and proline, which are thought to be
helix-breakers in soluble proteins, occur in the
transmembrane helices of cytochrome c oxidase
Tsukihara et al, 1995.
Studying known structures has revealed that
aromatic residues are often in the bilayer
interface, possibly anchoring the transmembrane
helix in the bilayer
Pawagi et al, 1994.

37
More data from structural studies

Serine and threonine can satisfy hydrogen bond
donors and acceptors by hydrogen bonding to
backbone carbonyls, making membrane localization
favorable (Engelman et al 1986)
Analysis of solved membrane proteins show TM
length ranges from 14-36 aa (varying due to
variations in lipid bilayer width)
Canonical alpha helix prediction methods derived
from soluble proteins are not as effective at
predicting TM-located helices

38
TMHMM provides a grammar to parse sequences into
subregions
39
TMHMM author findings

TMHMM correctly predicts 9798 of the
transmembrane helices.
TMHMM can discriminate between soluble and
membrane proteins with both specificity and
sensitivity better than 99
although the accuracy drops when signal peptides
are present
This high degree of accuracy allowed authors to
predict reliably integral membrane proteins in a
large collection of genomes.
Based on these predictions, authors estimate that
2030 of all genes in most genomes encode
membrane proteins
which is in agreement with previous estimates.
Proteins with Nin-Cin topologies are strongly
preferred in all examined organisms
except Caenorhabditis elegans, where the large
number of 7TM receptors increases the counts for
Nout-Cin topologies.

40
Aspects of model

Specialized modeling of various regions
Helix caps
Middle of helix
Regions close to membrane
Globular domains (all modeled identically)
TM amino acid stats derived from known TM domains

41
Training data
42
Signal peptide prediction
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
Chloroplast transit peptides are hard to detect
47
(No Transcript)
48
(No Transcript)
49
A plant GPCR??
Arabidopsis Thaliana GCR2
50
only one Arabidopsis putative GPCR protein
(GCR1) has been characterized in plants (1720),
and no ligand has been defined for any plant GPCR
51
Transmembrane structure prediction suggests that
GCR2 is a membrane protein with seven
transmembrane helices
-Despite bold claim of 7TM GPCR, only two
prediction servers used, no confidence values
indicated, and figure ended up in Supplemental
Material!
DAS
TMPred
52
(No Transcript)
53
Liu et al predicted GCR2 as a seven-transmembrane
protein (7TM), using the TMpred and DAS
programs, but did not report score thresholds to
evaluate the confidence of these predictions.
TMpred and DAS are known to erroneously predict
transmembrane helices within soluble proteins
(55 and 83 false positive rates, respectively)
GCR2 alignment with LanC superfamily (non 7-TM
GPCR)
54
(No Transcript)
55
(No Transcript)
56
We initially predicted that GCR2 was a
seventransmembrane protein using TMpred and
DAS software programs (2). We further used 12
distinct software programs to predict the
topological structure of GCR2 and found that 9 of
them (TMHMM, SOUSI, and DAS TMfilter
excluded) showed that GCR2 is a transmembrane
protein with various numbers of transmembrane
domains. TMHMM has underpredicted
transmembrane domains in many instances (3), and
the only other reported GPCR in Arabidopsis, GCR1
(4), was predicted to be a three-transmembrane
protein by SOSUI. In addition, about 14 of
known transmembrane proteins (established by
crystal structure or biochemical evidence) cannot
be correctly predicted by available software (3).
Thus, computational prediction of membrane
proteins is not yet a mature science and mainly
serves to generate hypotheses for experimental
testing
57
(No Transcript)
58
Discussion
Based on the evidence presented who do you think
is right? TM prediction and validation is
challenging both bioinformaticians
and Experimentalists alike!
59
TM/Signal Peptide/Localization prediction servers

Phobius http//phobius.sbc.su.se
-combined topology and signal peptide prediction
TMHMM http//www.cbs.dtk.dk/services/TMHMM/
-TM helix prediction

TargetP http//www.cbs.dtu.dk/services/TargetP/
-subcellular localization of eukaryotic proteins
SignalP http//www.cbs.dtu.dk/services/TargetP/
-predicts the presence and location of signal
peptide cleavage sites

60
Summary points

Protein localization is a critical aspect of
protein function
Methods for predicting localization can
over-estimate their expected accuracy
Datasets used in validation typically differ from
one method to the next, so results are not
comparable
Expect false positive and false negative
predictions
Consensus prediction using various types of
information and predictions are your best bet for
improving accuracy