Title: BCB 444544
1BCB 444/544
Lecture 25 Secondary Structure Prediction 25
Oct 21
- Thanks to Drena Dobbs for many borrowed
modified PPTs
2Required Reading (before lecture)
- Wed Oct 21 - for Lecture 25
- Chp 14
- Fri Oct 23 for Lecture 26
- Chp 16
3Homework Assignments
- HW 4 posted
- Due Monday, October 26th by 5pm
4Required Reading
- Yang Zhang (2008) Progress and challenges in
protein structure prediction. Curr. Opin.
Struct. Biol. 18342-348.
5544 Projects
6Exam II
- Exam II will be next Friday, October 31st
- More information coming soon
7Chou and Fasman
- Start by computing amino acids propensities to
belong to a given type of secondary structure
Propensities gt 1 mean that the residue type I is
likely to be found in the Corresponding secondary
structure type.
8The GOR method
Position-dependent propensities for helix, sheet
or turn is calculated for each amino acid. For
each position j in the sequence, eight residues
on either side are considered. A helix
propensity table contains information about
propensity for residues at 17 positions when the
conformation of residue j is helical. The helix
propensity tables have 20 x 17 entries. Build
similar tables for strands and turns. GOR
simplification The predicted state of AAj is
calculated as the sum of the position-dependent
propensities of all residues around AAj.
j
9Consensus Data Mining (CDM)
- Developed by Jernigan Group at ISU
- Basic premise combination of 2 complementary
methods can enhance performance by harnessing
distinct advantages of both methods combines
FDM GOR V - FDM - Fragment Data Mining - exploits
availability of sequence-similar fragments in the
PDB, which can lead to highly accurate prediction
- much better than GOR V - for such fragments,
but such fragments are not available for many
cases - GOR V - Garnier, Osguthorpe, Robson V - predicts
secondary structure of less similar fragments
with good performance these are protein
fragments for which FDM method cannot find
suitable structures - For references additional details
http//gor.bb.iastate.edu/cdm/
10Neural networks
- The most successful methods for predicting
secondary structure are based on neural networks.
The overall idea is that neural networks can be
trained to recognize amino acid patterns in known
secondary structure units, and to use these
patterns to distinguish between the different
types of secondary structure. - Neural networks classify input vectors or
examples into categories (2 or more) - They are loosely based on biological neurons.
11Biological Neurons
Dendrites receive inputs, Axon gives output Image
from Christos Stergiou and Dimitrios Siganos
http//www.doc.ic.ac.uk/nd/surprise_96/journal/v
ol4/cs11/report.html
12Artificial Neuron Perceptron
Image from Christos Stergiou and Dimitrios
Siganos http//www.doc.ic.ac.uk/nd/surprise_96/j
ournal/vol4/cs11/report.html
13The perceptron
X1
w1
T
w2
X2
wN
XN
Input
Threshold Unit
Output
The perceptron classifies the input vector X into
two categories. If the weights and threshold T
are not known in advance, the perceptron must be
trained. Ideally, the perceptron must be trained
to return the correct answer on all training
examples, and perform well on examples it has
never seen. The training set must contain both
type of data (i.e. with 1 and 0 output).
14The perceptron
Notes - The input is a vector X and the
weights can be stored in another vector
W. - the perceptron computes the dot product S
X.W - the output F is a function of S it is
often set discrete (i.e. 1 or 0), in which case
the function is the step function. For
continuous output, often use a sigmoid
1
1/2
0
0
15The perceptron
Training a perceptron Find the weights W that
minimizes the error function
P number of training data Xi training
vectors F(W.Xi) output of the perceptron t(Xi)
target value for Xi
Use steepest descent - compute gradient -
update weight vector - iterate
(e learning rate)
16Biological Neural Network
Image from http//en.wikipedia.org/wiki/Biological
_neural_network
17Artificial Neural Network
A complete neural network is a set of
perceptrons interconnected such that the
outputs of some units becomes the inputs of
other units. Many topologies are possible!
Neural networks are trained just like perceptron,
by minimizing an error function
18Neural networks and Secondary Structure prediction
- Experience from Chou and Fasman and GOR has shown
that - In predicting the conformation of a residue, it
is important to consider a window around it. - Helices and strands occur in stretches
- It is important to consider multiple sequences
19PHD Secondary structure prediction using NN
20PHD Input
For each residue, consider a window of size 13
13x20260 values
21Sequence-gtStructure Network
- for each amino acid aj, a window of 13 residues
aj-6ajaj6 is considered - The corresponding rows of the sequence profile
are fed into the neural network, and the output
is 3 probabilities for aj P(aj,alpha), P(aj,
beta) and P(aj,other)
22PHD Network 1 Sequence Structure
13x20 values
3 values
Network1
Pa(i) Pb(i) Pc(i)
23Structure-gtStructure network
- For each aj, PHD considers now a window of 17
residues the probabilities P(ak,alpha),
P(ak,beta) and P(ak,other) for k in j-8,j8 are
fed into the second layer neural network, which
again produces probabilities that residue aj is
in each of the 3 possible conformation
24PHD Network 2 Structure Structure
For each residue, consider a window of size 17
3 values
3 values
17x351 values
Network2
Pa(i) Pb(i) Pc(i)
Pa(i) Pb(i) Pc(i)
25PHD
- Jury system PHD has trained several neural
networks with different training sets all neural
networks are applied to the test sequence, and
results are averaged - Prediction For each position, the secondary
structure with the highest average score is
output as the prediction
26Secondary Structure Prediction Methods
- 1st Generation methods
- Ab initio - used relatively small dataset of
structures available - Chou-Fasman - based on amino acid propensities
(3-state) - GOR - also propensity-based (4-state)
- 2nd Generation methods
- based on much larger datasets of structures now
available - GOR II, III, IV, SOPM, GOR V, FDM
- 3rd Generation methods
- Homology-based Neural network based
- PHD, PSIPRED, SSPRO, PROF, HMMSTR, CDM
- Meta-Servers
- combine several different methods
- Consensus Ensemble based
- JPRED, PredictProtein, Proteus
27Secondary Structure Prediction Servers
- Prediction Evaluation?
- Q3 score - of residues correctly predicted
(3-state) - in cross-validation experiments
- Best results? Meta-servers
- http//expasy.org/tools/ (scroll for 2'
structure prediction) - http//www.russell.embl-heidelberg.de/gtsp/secstru
cpred.html - JPred www.compbio.dundee.ac.uk/www-jpred
- PredictProtein http//www.predictprotein.org/
Rost, Columbia - Best "individual" programs? ??
- CDM http//gor.bb.iastate.edu/cdm/
SenJernigan, ISU - FDM (not available separately as server)
ChengJernigan, ISU - GOR V http//gor.bb.iastate.edu/
KloczkowskyJernigan, ISU
28Secondary Structure Prediction for Different
Types of Proteins/Domains
- For Complete proteins
- Globular Proteins - use methods previously
described - Transmembrane (TM) Proteins - use special
methods - (next slides)
- For Structural Domains many under development
- Coiled-Coil Domains (Protein interaction
domains) - Zinc Finger Domains (DNA binding domains),
- others
-
29SS Prediction for Transmembrane Proteins
- Transmembrane (TM) Proteins
- Only a few in the PDB - but 30 of cellular
proteins are membrane-associated ! - Hard to determine experimentally, so prediction
important - TM domains are relatively 'easy' to predict!
- Why? constraints due to hydrophobic environment
- 2 main classes of TM proteins
- ??- helical
- ?- barrel
30SS Prediction for TM ?-Helices
- ??-Helical TM domains
- Helices are 17-25 amino acids long (span the
membrane) - Predominantly hydrophobic residues
- Helices oriented perpendicular to membrane
- Orientation can be predicted using "positive
inside" rule - Residues at cytosolic (inside or cytoplasmic)
side of TM helix, near hydrophobic anchor are
more positively charged than those on lumenal
(inside an organelle in eukaryotes) or
periplasmic side (space between inner outer
membrane in gram-negative bacteria) - Alternating polar hydrophobic residues provide
clues to interactions among helices within
membrane - Servers?
- TMHMM or HMMTOP - 70 accuracy - confused by
hydrophobic signal peptides (short hydrophobic
sequences that target proteins to the
endoplasmic reticulum, ER) - Phobius - 94 accuracy - uses distinct HMM
models for TM helices - signal peptide sequences
31SS Prediction for TM ?-Barrels ?
- ?-Barrel TM domains ?
- ?-strands are amphipathic (partly hydrophobic,
partly hydrophilic) - Strands are 10 - 22 amino acids long
- Every 2nd residue is hydrophobic, facing lipid
bilayer - Other residues are hydrophilic, facing "pore" or
opening - Servers? Harder problem, fewer servers
- TBBPred - uses NN or SVM
- Accuracy ?
32Prediction of Coiled-Coil Domains
- Coiled-coils
- Superhelical protein motifs or domains, with two
or more interacting ?-helices that form a
"bundle" - Often mediate inter-protein ( intra-protein)
interactions - 'Easy' to detect in primary sequence
- Internal repeat of 7 residues (heptad)
- 1 4 hydrophobic (facing helical interface)
- 2,3,5,6,7 hydrophilic (exposed to solvent)
- Helical wheel representation - can be used
manually detect these, based on amino acid
sequence - Servers?
- Coils, Multicoil - probability-based methods
- 2Zip - for Leucine zippers special type of CC
in TFs - characterized by Leu-rich motif
L-X(6)-L-X(6)-L-X(6)-L