Title: Protein Feature Identification
1Protein Feature Identification
- Microbiology 343
- David Wishart
- david.wishart_at_ualberta.ca
- To show that almost everything you do in the lab
or what you need to do to work with a protein can
be done on a computer - Learning methods and algorithms for predicting
composition and sequence features - Learning when to use these tools
- Exhibit far more sequence and chemical
complexity than DNA or RNA - Properties and structure are defined by the
sequence and side chains of their constituent
amino acids - The engines of life
- gt95 of all drugs targets are proteins
- Favorite topic of post-genomic era
4The Post-genomic Challenge
- How to rapidly identify a protein?
- How to rapidly purify a protein?
- How to identify post-trans modification?
- How to find information about function?
- How to find information about activity?
- How to find information about location?
- How to find information about structure?
Answer Look at Protein Features
5Protein Features
Sequence View Structure View
6Different Types of Features
- Composition Features
- Mass, pI, Absorptivity, Rg, Volume
- Sequence Features
- Active sites, Binding Sites, Targeting, Location,
Property Profiles, 2o structure - Structure Features
- Supersecondary Structure, Global Fold, ASA,
7Where To Go
8Compositional Features
- Molecular Weight
- Amino Acid Frequency
- Isoelectric Point
- UV Absorptivity
- Solubility, Size, Shape
- Radius of Gyration
- Free Energy of Folding
9Molecular Weight
10Molecular Weight
- Useful for SDS PAGE and 2D gel analysis
- Useful for deciding on SEC matrix
- Useful for deciding on MWC for dialysis
- Essential in synthetic peptide analysis
- Essential in peptide sequencing (classical or
mass-spectrometry based) - Essential in proteomics and high throughput
protein characterization
11Molecular Weight
- Crude MW calculation MW 110 X Numres
- Exact MW calculation MW SAAi x MWi
- Remember to add 1 water (18.01 amu) after adding
all res. - Note isotopic weights
- Corrections for CHO, PO4, Acetyl, CONH2
12Amino Acid versus Residue
Amino Acid Residue
13Protein Identification via MW
- http//srs.hgmp.mrc.ac.uk/cgi-bin/mowse
- CombSearch
- http//ca.expasy.org/tools/CombSearch/
- Mascot
- http//www.matrixscience.com/search_form_select.ht
ml - AACompSim/AACompIdent
- http//ca.expasy.org/tools/
14Molecular Weight Proteomics
2-D Gel QTOF Mass Spectrometry
15Amino Acid Frequency
- Deviations greater than 2X average indicate
something of interest - High K or R indicates possible nucleoprotein
- High Cs indicate stable but hard-to-fold protein
- High G, P, Q, or N says lack of stable structure
16Isoelectric Point (pI)
- The pH at which a protein has a net charge0
- Q S Ni/(1 10pH-pKi)
Transcendental equation
17Isoelectric Point
- Calculation is only approximate (/- 1 pH)
- Does not include 3o structure interactions
- Can be used in developing purification protocols
via ion exchange chromatography - Can be used in estimating spot location for
isoelectric focusing gels - Can be used to decide on best pH to store or
analyze protein
18UV Spectroscopy
19UV Absorptivity
- UV (Ultraviolet light) has a wavelength of 200 to
400 nm - Most proteins and peptides (and all nucleic
acids) absorb UV light quite strongly - UV spectroscopy is the most common form of
spectroscopy performed today - UV spectra can be used to identify or classify
some proteins or protein classes
20UV Absorptivity
- OD280 (5690 x W 1280 x Y)/MW x Conc.
- Conc. OD280 x MW/(5690 X W 1280 x Y)
- Indicates Solubility
- Indicates Stability
- Indicates Location (membrane or cytoplasm)
- Indicates Globularity or tendency to form
spherical structure
- Average Hydrophobicity AH S AAi x Hi
- Hydrophobic Ratio RH S H(-)/S H()
- Hydrophobic Ratio RHP philic/phobic
- Linear Charge Density LIND (KRDEH2)/
- Solubility SOL RH LIND - 0.05AH
- Average AH 2.5 2.5 Insol gt 0.1 Unstrc lt
-6 - Average RH 1.2 0.4 Insol lt 0.8 Unstrc gt
1.9 - Average RHP 0.9 0.2 Insol lt 0.7 Unstrc gt 1.4
- Average LIND 0.25 Insol lt 0.2 Unstrc gt 0.4
- Average SOL 1.6 0.5 Insol lt 1.1 Unstrc gt 2.5
23Different Types of Features
- Composition Features
- Mass, pI, Absorptivity, Hydrophobicity
- Sequence Features
- Active sites, Binding Sites, Targeting, Location,
Property Profiles, 2o structure - Structure Features
- Supersecondary Structure, Global Fold, ASA,
24Sequence Features
25Sites that Support Pattern Queries
- OWL Database
- http//umber.sbs.man.ac.uk/dbbrowser/OWL/
- PIR Website
- http//pir.georgetown.edu/pirwww/search/patmatch.h
- http//ca.expasy.org/tools/scanprosite/
- PattinProt
- http//npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?p
26Regular Expressions
- CACGT - Matches CAT, CCT and CGT only
- C . T - Matches CAT, CaT, C1T, CXT, not CT
- CA?T - Matches CT or CAT only
- CT - Matches CT, CCT, CCCT, CCCCT
- SA-I,L-Q,T-Z?LKA-I,L-Q,T-Z?A - Matches SLKA
27PROSITE Pattern Expressions
C - ACG - T - Matches CAT, CCT and CGT only C -
X -T - Matches CAT, CCT, CDT, CET, etc. C - A
-T - Matches every CXT except CAT C - (1,3) - T -
Matches CT, CCT, CCCT C - A(2) - TP - Matches
CAAT, CAAP LIV - VIC - X(2) - G - DENQ - X
- LIVFM (2) -G
28Sequence Feature Databases
- PROSITE - http//ca.expasy.org/prosite/
- BLOCKS - http//www.blocks.fhcrc.org/
- DOMO - http//www.infobiogen.fr/services/domo/
- PFAM - http//pfam.wustl.edu
- PRINTS - http//www.bioinf.man.ac.uk/dbbrowser/PRI
NTS/ - SEQSITE - PepTool
29Phosphorylation Sites
30Phosphorylation Sites
31Signaling Sites
32Protease Cut Sites
33Binding Sites
34Family Signature Sequences
35Enzyme Active Sites
36Better Methods for Sequence Feature ID
- Sequence Profiles/Scoring Matrices
- Neural Networks
- Hidden Markov Models
- Bayesian Belief Nets
- Reference Point Logistics
37What Can Be Predicted?
- O-Glycosylation Sites
- Phosphorylation Sites
- Protease Cut Sites
- Nuclear Targeting Sites
- Mitochondrial Targ Sites
- Chloroplast Targ Sites
- Signal Sequences
- Signal Sequence Cleav.
- Peroxisome Targ Sites
- ER Targeting Sites
- Transmembrane Sites
- Tyrosine Sulfation Sites
- GPInositol Anchor Sites
- PEST sites
- Coil-Coil Sites
- T-Cell/MHC Epitopes
- Protein Lifetime
- A whole lot more.
38Cutting Edge Sequence Feature Servers
- Membrane Helix Prediction
- http//www.cbs.dtu.dk/services/TMHMM-2.0/
- T-Cell Epitope Prediction
- http//syfpeithi.bmi-heidelberg.com/scripts/MHCSer
ver.dll/home.htm - O-Glycosylation Prediction
- http//www.cbs.dtu.dk/services/NetOGlyc/
- Phosphorylation Prediction
- http//www.cbs.dtu.dk/services/NetPhos/
- Protein Localization Prediction
- http//psort.nibb.ac.jp/
39Subcellular Localization
40Profiles Motifs are Useful
- Helped identify active site of HIV protease
- Helped identify SH2/SH3 class of STPs
- Helped identify important GTP oncoproteins
- Helped identify hidden leucine zipper in HGA
- Used to scan for lectin binding domains
- Regularly used to predict T-cell epitopes
41Amino Acid Property Profiles
42Amino Acid Property Profiles
- Intent is to predict proteins physical
properties directly from sequence as opposed to
composition or wet chemistry - Offers a more detailed, graphical view of
sequence-specific properties than compositional
analysis (more powerful?) - Underlying assumption is amino acid properties
are additive
43Common Property Profiles
- Hydrophobicity (Watch Scales!)
- Helical Wheel (Not a True Profile)
- Hydrophobic Moments (Helix Beta sheet)
- Flexibility (Thermal B Factors)
- Surface Accessibility (ASA)
- Antigenicity (B-cell epitopes/T-cell epitopes)
44Hydrophobicity Profile
- Plotted using ltHgti S Hn/(2k 1)
- Shows location of membrane spanning regions,
epitopes, surface exposed AAs, etc.
- B factors from X-ray crystallography
- Potentially identifies antigenic and active sites
from sequence data alone
46Membrane Spanning Regions
47Predicting via Hydrophobicity
Bacteriorhodoposin OmpA
48Predicting via Hydrophobicity
49Predicting via Neural Nets
- PHDhtm http//cubic.bioc.columbia.edu/predictpro
tein/submit_adv.html - TMAP http//www.mbb.ki.se
/tmap/index.html - TMPred http//www.ch.embnet.org/software/TMPRED
50Secondary Structure
51Secondary Structure Prediction
52Secondary Structure Prediction
- Statistical (Chou-Fasman, GOR)
- Homology or Nearest Neighbor (Levin)
- Physico-Chemical (Lim, Eisenberg)
- Pattern Matching (Cohen, Rooman)
- Neural Nets (Qian Sejnowski, Karplus)
- Evolutionary Methods (Barton, Niemann)
- Combined Approaches (Rost, Levin, Argos)
53Chou-Fasman Statistics
54Prediction Performance
55Best of the Best
- PredictProtein-PHD (72)
- http//cubic.bioc.columbia.edu/predictprotein
- Jpred (73-75)
- http//www.compbio.dundee.ac.uk/www-jpred/
- PSIpred (77)
- http//bioinf.cs.ucl.ac.uk/psipred/
- Proteus (88)
- http//
56Sample Exam Questions
- Here is the sequence for protein X, calculate its
molar absorptivity - Here is the sequence for protein Y, try to locate
the likely membrane spanning regions explain
your reasoning - Here is the sequence for protein Z, show the
tryptic cleavage points