Title: Shan Sundararaj
1Protein Subcellular Localization
- Shan Sundararaj
- University of Alberta
- Edmonton, AB
- ss23_at_ualberta.ca
2Why is Localization Important?
- Function is dependent on context
- Co-localization of proteins of related function
- Valuable annotation for new proteins
- Design of proteins with specific targets
- Drug targeting
- Accessibility
- Membrane-bound gt cytoplasmic gt nuclear
3Why is Localization Important?
- 1999 Nobel Prize in Physiology/Medicine given to
Günter Blobel - for the discovery that proteins have intrinsic
signals that govern their transport and
localization in the cell
4Bacteria
Gram Positive (3-4 states)
Gram Negative (5 states)
Extracellular
cytoplasm
cytoplasm
periplasm
cytoplasmic membrane
cytoplasmic membrane
cell wall
outer membrane
Extracellular
5Eukaryotic Cell
- Compartmentalized
- Diverse range of specific organelles
- Plants chloroplasts, chromoplasts, other
plastids - Muscle sarcoplasm
- Various endosomes, vesicles
(modified from Voet Voet, Biochemystry
Wiley-VCH 1992)
6Yet more categories
Chloroplast
Mitochondrion
Yeast specific
7Level of Annotation
- As simple as two states
- membrane protein vs. non-membrane protein
- secreted protein vs. non-secreted protein
- Gross compartments
- cytoplasm, inner membrane, periplasm, cell wall,
outer membrane, extracellular - nucleus, mitochondria, peroxisome, vacuole
- Fine compartments
- Mitochondrial matrix, bud neck, spindle pole
- Any of 1425 GO cellular compartments
8Localization signaling
- Proteins must have intrinsic signals for their
localization a cellular address - E.g. N-terminal signal sequences
321 Nuclear Inner Membrane Lane Nucleus,
Intracellular county Eukaryotic Cell CL34V3M3
9Localization signaling
- Some signals are easily recognizable
- Signal peptidase cleavage site, consensus
sequence for secretion ? extracellular - Address printed neatly, postal code
- Others are difficult to understand
- Outer membrane b-barrel proteins, no consensus
sequence, few sequence restraints - Sloppy address, different kind of code that we
dont understand yet
10Experimental determination
- Since dont fully understand the language of
proteins, our knowledge must often come from
inference - Predicting localization is like sorting mail
based only on examples of where some mail has
gone before - Important to have good data sets of proteins with
known localizations
11Datasets
- DBSubLoc (http//www.bioinfo.tsinghua.edu.cn/guot
ao/download.html) - Combines SwissProt and PIR localization
annotations (64051 proteins!) - PSORT-B
- (http//www.psort.org/dataset/)
- 1591 Gram ve proteins, 576 Gram ve proteins
- SignalP
- (http//www.cbs.dtu.dk/ftp/signalp/)
- 940 plant and 2738 human proteins
- YPL
- (http//bioinfo.mbb.yale.edu/genome/localize/)
- 2956 yeas proteins
-
12Experimental Methods
- Electron microscopy
- GFP tagging / fluorescence microscopy
- Subcellular fractionation detection
- Western blotting
- Mass spectrometry
13Electron Microscopy
- Highest resolution, can work at the level of a
single protein complex - Immunolabel proteins of interest in conjunction
with colloidal gold, and visualize - Combined with electron tomography, can even
visualize unlabeled complexes
(from Koster and Klumperman, Nat Rev Mol Cell
Biol, Sep 2003, S6-10)
14Fluorescence Microscopy
- Tag gene at either 3 or 5 end
- Using GFP (or RFP, YFP, CFP, etc.)
- Using an epitope tag and a fluorescently labeled
antibody - Careful of removing signal peptides!
- Also use a subcellular-specific marker or stain
- Visualize with confocal fluorescence microscopy
and analyze images for co-localization
15Specific co-labeling (yeast)
- Early GolgiCop1
- Endosome Snf7
- ER to Golgi Sec13
- Golgi apparatus Anp1
- Late Golgi Chc1
- Lipid particle Erg6
- Mitochondrion MitoTracker
- Nucleus DAPI
- Nucleolus Sik1
- Nuclear periphery Nic96
- Peroxisome Pex3
- Vacuole FM4-64
Nuclear-specific DAPI staining
16Subcellular Fractionation
transfer supernatant
transfer supernatant
transfer supernatant
1000 g
10,000 g
100,000 g
Pellet microsomal Fraction (ER,
golgi, lysosomes, peroxisomes)
Pellet unbroken cells nuclei chloroplast
Pellet mitochondria
Super. Cytosol, Soluble enzymes
tissue homogenate
17Detergent Fractionation
Cells
Extraction with Digitonin/EDTA
supernatant
pellet
Extraction with TritonX100/EDTA
Cytoplasmic Fraction
Extraction with SDS/EDTA
Organelle Membranes
Nuclear
Cytoskeletal (in SDS)
18Fractionation ? Identification
- Once fractionated, take compartment of interest
and separate proteins - 2D gel or chromatography
- Identify separated proteins
- Mass spectrometry for high-throughput
- Western blot for specific proteins
19Fractionation in proteomics
20Recent High-Throughput Exp.
- Kumar et al., Genes Dev 2002, 16707-719
- Epitope-tagged gt60 of ORFs, visualized with
fluorescently labeled antibody - 2744 localizations (44 of S. cerevisiae genes)
- Huh et al., Nature 2003, 425686-691
- GFP tagged all ORFs, RFP tagged compartments
- 4156 localizations (75 of S. cerevisiae genes)
- Combined, now nearly 87 of yeast proteins have a
localization annotation
21Predictions from known data
- Enough experimental data exists to build highly
accurate computational predictors of localization
22Predictions from known data
- Different information used for predictions
- Sequence motifs
- N-terminal secretory signal peptides,
mitochondrial targeting peptide, chloroplast
transit peptide - C-terminal peroxisome import signal, ER
retention signal - Mid-sequence nuclear localization signals
- Amino acid composition
- AA frequency, dipeptide composition.
- Homology
- - Sequence comparison to proteins of known
localization
23N-terminal signal peptides
24N-terminal signal peptides
- Common structure of signal peptides
- positively charged n-region, followed by a
hydrophobic h-region and a neutral but polar
c-region.
25More work to do
- Multiple bacterial secretion pathways
- C-terminal signal peptides
- Internal mitochondrial transit peptides
- Structural aspects of targeting
- Gene re-localization
- Still a lot to discover in how signaling works!
26Computational methods for predicting localization
- Expert rule based methods
- Artificial Neural Nets (ANN)
- Hidden Markov Models (HMM)
- Naïve Bayes (NB)
- Support Vector Machines (SVM)
- Combination of above methods
27Naïve Bayes
- Assumption
- Features are conditionally
- independent, given class labels
- Structure
- 1 level tree
- Class labels root
- Features leaf nodes
- Prediction
- class(f) argmax P(Cc)P(Ff Cc)
- c
28Artificial Neural Network
- Excellent for modeling non-linear input/output
relationships - Robust to noise in training data
- Widely used in bioinformatics
29Support Vector Machines
- Input vectors are separated into positive vs.
negative instance - Map to new feature space
- Find hyperplane that best separates the two
classes by distance
30Evaluating Predictors - Precision
Predicted
True
- of proteins correctly labeled as cyt divided
by the total of proteins labeled as cyt - How often the label is correct
- If there are 90 proteins correctly labeled as
cyt, and 10 proteins incorrectly labeled as
cyt, then the precision is 90/100 0.90.
31Evaluating Predictors - Sensitivity
Predicted
True
- of proteins correctly labeled as cytoplasmic
divided by the total of proteins that are
cytoplasmic - How many of the true results were retrieved
(also called recall or accuracy)
32Predictions from known data
- Different information used for predictions
- Sequence motifs
- N-terminal secretory signal peptides,
mitochondrial targeting peptide, chloroplast
transit peptide - C-terminal peroxisome import signal, ER
retention signal - Mid-sequence nuclear localization signals
- Amino acid composition
- AA frequency, dipeptide composition.
- Homology
- - Sequence comparison to proteins of known
localization
33TargetP, SignalP, Phttp//www.cbs.dtu.dk/service
s/
- Sequence-based methods
- TargetP (85-90 recall)
- Predicts mitochondria/chloroplast/secreted
- Contains SignalP and ChloroP
- LipoP
- lipoproteins and signal peptides in Gram negative
bacteria - SecretomeP
- non-classical secretion in eukaryotes
34SignalP result
Cleavage site
Prediction Signal peptide Signal peptide
probability 0.945 Signal anchor probability
0.000 Max cleavage site probability 0.723
between pos. 28 and 29
35Organellar Prediction
- Predotar (http//www.inra.fr/predotar/) (80
recall) - Mitochondrial and plastid sequences N-terminal
sequences - MitoPred (http//mitopred.sdsc.edu/)
- Mitochondrial PFAM domains, AA composition
- MitoProteome (http//www.mitoproteome.org/)
- Database of experimentally predicted human
mitochondrial - MitoP (http//ihg.gsf.de/mitop2/)
- Combines data from multiple experimental and
computational sources to give a consensus score
for each mitochondrial protein in yeast and
human
36The PSORT Family
- PSORT plant sequences
- PSORT II eukaryotic sequences
- iPSORT eukaryotic N-term. signal sequences
- PSORT-B bacterial sequences
37PSORT-Bhttp//www.psort.org/psortb/
38PSORT-B - methods
- Signal peptides Non-cytoplasmic
- AA composition/patterns
- SVMs trained for each location vs. all other
locations - Transmembrane helices Inner membrane
- HMMTOP
- PROSITE motifs all localizations
- Outer membrane motifs Outer membrane
- Homology to proteins of known localization
- SCL-BLAST
Integration with a Bayesian network
39PSORT-B results
- SeqID Unannotated_bacterial2
- Analysis Report
- CMSVM- Unknown No details
- CytoSVM- Cytoplasmic No details
- ECSVM- Unknown No details
- HMMTOP- Unknown No internal
helices found - Motif- Unknown No motifs
found - OMPMotif- Unknown No motifs
found - OMSVM- Unknown No details
- PPSVM- Unknown No details
- Profile- Unknown No matches
to profiles found - SCL-BLAST- Cytoplasmic matched
118438 Cyto. protein - SCL-BLASTe- Unknown No matches
against database - Signal- Unknown No signal
peptide detected - Localization Scores
- Cytoplasmic 9.97
- CytoplasmicMembrane 0.01
- Periplasmic 0.01
- OuterMembrane 0.00
40Proteome Analysthttp//www.cs.ualberta.ca/bioinf
o/PA/Sub/
41Proteome Analyst - Method
42Proteome Analyst - Feature Extraction
43Proteome Analyst Feature Extraction
- TOP 3 Homologs
- ? AFP1_ARATH
- AFP1_BRANA
- AFP2_ARATH
- KW
- Plant defense Fungicide
- Signal Multigene Family
- Pyrrolidone carboxylic acid
- DR InterPro
- IPR002118 IPR003614
- CC Subcellular location
- Secreted
- Token Set
Plant defense Fungicide Signal Multigene
Family Pyrrolidone carboxylic acid IPR002118
IPR003614 Secreted
44PASub - Results
Contribution of each token
Log scale
Features
45PASub - Interpretation
- Bars represent -log probability, so a little
difference is a lot! - Naïve Bayes chosen as classifier because of
transparency of method - Each token gives a probability that can be summed
and shown graphically - Neural network actually has higher recall
- Can change token set, ask to explain with
different features
46Save Time Pre-computed Genomes
- PSORT-B 2.0
- http//www.psort.org/genomes/
- 103 Gram ve bacteria, 45 Gramve bacteria
- Proteome Analyst
- http//www.cs.ualberta.ca/bioinfo/
- Human, mouse, fly, yeast, Plasmodium falciparum,
E. coli, B. subtilis