Title: Protein Feature Identification
1Protein Feature Identification
- Microbiology 343
- David Wishart
- david.wishart_at_ualberta.ca
2Objectives
- To show that almost everything you do in the lab
or what you need to do to work with a protein can
be done on a computer - Learning methods and algorithms for predicting
composition and sequence features - Learning when to use these tools
3Proteins
- Exhibit far more sequence and chemical
complexity than DNA or RNA - Properties and structure are defined by the
sequence and side chains of their constituent
amino acids - The engines of life
- gt95 of all drugs targets are proteins
- Favorite topic of post-genomic era
4The Post-genomic Challenge
- How to rapidly identify a protein?
- How to rapidly purify a protein?
- How to identify post-trans modification?
- How to find information about function?
- How to find information about activity?
- How to find information about location?
- How to find information about structure?
Answer Look at Protein Features
5Protein Features
ACEDFHIKNMF SDQWWIPANMC ASDFDPQWERE LIQNMDKQERT QA
TRPQDS...
Sequence View Structure View
6Different Types of Features
- Composition Features
- Mass, pI, Absorptivity, Rg, Volume
- Sequence Features
- Active sites, Binding Sites, Targeting, Location,
Property Profiles, 2o structure - Structure Features
- Supersecondary Structure, Global Fold, ASA,
Volume
7Where To Go
http//www.expasy.org/tools/
8Compositional Features
- Molecular Weight
- Amino Acid Frequency
- Isoelectric Point
- UV Absorptivity
- Solubility, Size, Shape
- Radius of Gyration
- Free Energy of Folding
9Molecular Weight
10Molecular Weight
- Useful for SDS PAGE and 2D gel analysis
- Useful for deciding on SEC matrix
- Useful for deciding on MWC for dialysis
- Essential in synthetic peptide analysis
- Essential in peptide sequencing (classical or
mass-spectrometry based) - Essential in proteomics and high throughput
protein characterization
11Molecular Weight
- Crude MW calculation MW 110 X Numres
- Exact MW calculation MW SAAi x MWi
- Remember to add 1 water (18.01 amu) after adding
all res. - Note isotopic weights
- Corrections for CHO, PO4, Acetyl, CONH2
12Amino Acid versus Residue
R
R
C
C
CO
N
COOH
H2N
H
H
H
Amino Acid Residue
13Protein Identification via MW
- MOWSE
- http//srs.hgmp.mrc.ac.uk/cgi-bin/mowse
- CombSearch
- http//ca.expasy.org/tools/CombSearch/
- Mascot
- http//www.matrixscience.com/search_form_select.ht
ml - AACompSim/AACompIdent
- http//ca.expasy.org/tools/
14Molecular Weight Proteomics
2-D Gel QTOF Mass Spectrometry
15Amino Acid Frequency
- Deviations greater than 2X average indicate
something of interest - High K or R indicates possible nucleoprotein
- High Cs indicate stable but hard-to-fold protein
- High G, P, Q, or N says lack of stable structure
16Isoelectric Point (pI)
- The pH at which a protein has a net charge0
- Q S Ni/(1 10pH-pKi)
Transcendental equation
17Isoelectric Point
- Calculation is only approximate (/- 1 pH)
- Does not include 3o structure interactions
- Can be used in developing purification protocols
via ion exchange chromatography - Can be used in estimating spot location for
isoelectric focusing gels - Can be used to decide on best pH to store or
analyze protein
18UV Spectroscopy
19UV Absorptivity
- UV (Ultraviolet light) has a wavelength of 200 to
400 nm - Most proteins and peptides (and all nucleic
acids) absorb UV light quite strongly - UV spectroscopy is the most common form of
spectroscopy performed today - UV spectra can be used to identify or classify
some proteins or protein classes
20UV Absorptivity
- OD280 (5690 x W 1280 x Y)/MW x Conc.
- Conc. OD280 x MW/(5690 X W 1280 x Y)
OH
N
21Hydrophobicity
- Indicates Solubility
- Indicates Stability
- Indicates Location (membrane or cytoplasm)
- Indicates Globularity or tendency to form
spherical structure
22Hydrophobicity
- Average Hydrophobicity AH S AAi x Hi
- Hydrophobic Ratio RH S H(-)/S H()
- Hydrophobic Ratio RHP philic/phobic
- Linear Charge Density LIND (KRDEH2)/
- Solubility SOL RH LIND - 0.05AH
- Average AH 2.5 2.5 Insol gt 0.1 Unstrc lt
-6 - Average RH 1.2 0.4 Insol lt 0.8 Unstrc gt
1.9 - Average RHP 0.9 0.2 Insol lt 0.7 Unstrc gt 1.4
- Average LIND 0.25 Insol lt 0.2 Unstrc gt 0.4
- Average SOL 1.6 0.5 Insol lt 1.1 Unstrc gt 2.5
23Different Types of Features
- Composition Features
- Mass, pI, Absorptivity, Hydrophobicity
- Sequence Features
- Active sites, Binding Sites, Targeting, Location,
Property Profiles, 2o structure - Structure Features
- Supersecondary Structure, Global Fold, ASA,
Volume
24Sequence Features
AHGQSDFILDEADGMMKSTVPN HGFDSAAVLDEADHILQWERTY
GGGNDEYIVDEADSVIASDFGH LIVMLIVMDEADLIVM
LIVM (EIF 4A ATP DEPENDENT HELICASE)
25Sites that Support Pattern Queries
- OWL Database
- http//umber.sbs.man.ac.uk/dbbrowser/OWL/
- PIR Website
- http//pir.georgetown.edu/pirwww/search/patmatch.h
tml - SCNPSITE at EXPASY
- http//ca.expasy.org/tools/scanprosite/
- PattinProt
- http//npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?p
agenpsa_pattinprot.html/
26Regular Expressions
- CACGT - Matches CAT, CCT and CGT only
- C . T - Matches CAT, CaT, C1T, CXT, not CT
- CA?T - Matches CT or CAT only
- CT - Matches CT, CCT, CCCT, CCCCT
- C(HE)?ATP - Matches CHEAT, CAT, CHEAP, CAP
- SA-I,L-Q,T-Z?LKA-I,L-Q,T-Z?A - Matches SLKA
27PROSITE Pattern Expressions
C - ACG - T - Matches CAT, CCT and CGT only C -
X -T - Matches CAT, CCT, CDT, CET, etc. C - A
-T - Matches every CXT except CAT C - (1,3) - T -
Matches CT, CCT, CCCT C - A(2) - TP - Matches
CAAT, CAAP LIV - VIC - X(2) - G - DENQ - X
- LIVFM (2) -G
28Sequence Feature Databases
- PROSITE - http//ca.expasy.org/prosite/
- BLOCKS - http//www.blocks.fhcrc.org/
- DOMO - http//www.infobiogen.fr/services/domo/
- PFAM - http//pfam.wustl.edu
- PRINTS - http//www.bioinf.man.ac.uk/dbbrowser/PRI
NTS/ - SEQSITE - PepTool
29Phosphorylation Sites
pY
pT
pS
PO4
PO4
CH3
PO4
30Phosphorylation Sites
31Signaling Sites
32Protease Cut Sites
33Binding Sites
34Family Signature Sequences
35Enzyme Active Sites
36Better Methods for Sequence Feature ID
- Sequence Profiles/Scoring Matrices
- Neural Networks
- Hidden Markov Models
- Bayesian Belief Nets
- Reference Point Logistics
37What Can Be Predicted?
- O-Glycosylation Sites
- Phosphorylation Sites
- Protease Cut Sites
- Nuclear Targeting Sites
- Mitochondrial Targ Sites
- Chloroplast Targ Sites
- Signal Sequences
- Signal Sequence Cleav.
- Peroxisome Targ Sites
- ER Targeting Sites
- Transmembrane Sites
- Tyrosine Sulfation Sites
- GPInositol Anchor Sites
- PEST sites
- Coil-Coil Sites
- T-Cell/MHC Epitopes
- Protein Lifetime
- A whole lot more.
38Cutting Edge Sequence Feature Servers
- Membrane Helix Prediction
- http//www.cbs.dtu.dk/services/TMHMM-2.0/
- T-Cell Epitope Prediction
- http//syfpeithi.bmi-heidelberg.com/scripts/MHCSer
ver.dll/home.htm - O-Glycosylation Prediction
- http//www.cbs.dtu.dk/services/NetOGlyc/
- Phosphorylation Prediction
- http//www.cbs.dtu.dk/services/NetPhos/
- Protein Localization Prediction
- http//psort.nibb.ac.jp/
39Subcellular Localization
http//www.cs.ualberta.ca/bioinfo/PA/Sub/
40Profiles Motifs are Useful
- Helped identify active site of HIV protease
- Helped identify SH2/SH3 class of STPs
- Helped identify important GTP oncoproteins
- Helped identify hidden leucine zipper in HGA
- Used to scan for lectin binding domains
- Regularly used to predict T-cell epitopes
41Amino Acid Property Profiles
42Amino Acid Property Profiles
- Intent is to predict proteins physical
properties directly from sequence as opposed to
composition or wet chemistry - Offers a more detailed, graphical view of
sequence-specific properties than compositional
analysis (more powerful?) - Underlying assumption is amino acid properties
are additive
43Common Property Profiles
- Hydrophobicity (Watch Scales!)
- Helical Wheel (Not a True Profile)
- Hydrophobic Moments (Helix Beta sheet)
- Flexibility (Thermal B Factors)
- Surface Accessibility (ASA)
- Antigenicity (B-cell epitopes/T-cell epitopes)
44Hydrophobicity Profile
- Plotted using ltHgti S Hn/(2k 1)
- Shows location of membrane spanning regions,
epitopes, surface exposed AAs, etc.
45Flexibility
- B factors from X-ray crystallography
- Potentially identifies antigenic and active sites
from sequence data alone
46Membrane Spanning Regions
47Predicting via Hydrophobicity
Bacteriorhodoposin OmpA
48Predicting via Hydrophobicity
49Predicting via Neural Nets
- PHDhtm http//cubic.bioc.columbia.edu/predictpro
tein/submit_adv.html - TMAP http//www.mbb.ki.se
/tmap/index.html - TMPred http//www.ch.embnet.org/software/TMPRED
_form.html
ACDEGF...
50Secondary Structure
51Secondary Structure Prediction
52Secondary Structure Prediction
- Statistical (Chou-Fasman, GOR)
- Homology or Nearest Neighbor (Levin)
- Physico-Chemical (Lim, Eisenberg)
- Pattern Matching (Cohen, Rooman)
- Neural Nets (Qian Sejnowski, Karplus)
- Evolutionary Methods (Barton, Niemann)
- Combined Approaches (Rost, Levin, Argos)
53Chou-Fasman Statistics
54Prediction Performance
55Best of the Best
- PredictProtein-PHD (72)
- http//cubic.bioc.columbia.edu/predictprotein
- Jpred (73-75)
- http//www.compbio.dundee.ac.uk/www-jpred/
- PSIpred (77)
- http//bioinf.cs.ucl.ac.uk/psipred/
- Proteus (88)
- http//129.128.185.1848080/proteus/
56Sample Exam Questions
- Here is the sequence for protein X, calculate its
molar absorptivity - Here is the sequence for protein Y, try to locate
the likely membrane spanning regions explain
your reasoning - Here is the sequence for protein Z, show the
tryptic cleavage points