BCB 444544 - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

BCB 444544

Description:

Thanks to Drena Dobbs for many borrowed & modified PPTs ... For continuous output, often use a sigmoid: 0. 1/2. 1. 0. The perceptron ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 33

Provided by: dobbslabG

Category:

more less

Transcript and Presenter's Notes

Title: BCB 444544

1
BCB 444/544
Lecture 25 Secondary Structure Prediction 25
Oct 21

Thanks to Drena Dobbs for many borrowed
modified PPTs

2
Required Reading (before lecture)

Wed Oct 21 - for Lecture 25
Chp 14
Fri Oct 23 for Lecture 26
Chp 16

3
Homework Assignments

HW 4 posted
Due Monday, October 26th by 5pm

4
Required Reading

Yang Zhang (2008) Progress and challenges in
protein structure prediction. Curr. Opin.
Struct. Biol. 18342-348.

5
544 Projects
6
Exam II

Exam II will be next Friday, October 31st
More information coming soon

7
Chou and Fasman

Start by computing amino acids propensities to
belong to a given type of secondary structure

Propensities gt 1 mean that the residue type I is
likely to be found in the Corresponding secondary
structure type.
8
The GOR method
Position-dependent propensities for helix, sheet
or turn is calculated for each amino acid. For
each position j in the sequence, eight residues
on either side are considered. A helix
propensity table contains information about
propensity for residues at 17 positions when the
conformation of residue j is helical. The helix
propensity tables have 20 x 17 entries. Build
similar tables for strands and turns. GOR
simplification The predicted state of AAj is
calculated as the sum of the position-dependent
propensities of all residues around AAj.
j
9
Consensus Data Mining (CDM)

Developed by Jernigan Group at ISU
Basic premise combination of 2 complementary
methods can enhance performance by harnessing
distinct advantages of both methods combines
FDM GOR V
FDM - Fragment Data Mining - exploits
availability of sequence-similar fragments in the
PDB, which can lead to highly accurate prediction
- much better than GOR V - for such fragments,
but such fragments are not available for many
cases
GOR V - Garnier, Osguthorpe, Robson V - predicts
secondary structure of less similar fragments
with good performance these are protein
fragments for which FDM method cannot find
suitable structures
For references additional details
http//gor.bb.iastate.edu/cdm/

10
Neural networks

The most successful methods for predicting
secondary structure are based on neural networks.
The overall idea is that neural networks can be
trained to recognize amino acid patterns in known
secondary structure units, and to use these
patterns to distinguish between the different
types of secondary structure.
Neural networks classify input vectors or
examples into categories (2 or more)
They are loosely based on biological neurons.

11
Biological Neurons
Dendrites receive inputs, Axon gives output Image
from Christos Stergiou and Dimitrios Siganos
http//www.doc.ic.ac.uk/nd/surprise_96/journal/v
ol4/cs11/report.html
12
Artificial Neuron Perceptron
Image from Christos Stergiou and Dimitrios
Siganos http//www.doc.ic.ac.uk/nd/surprise_96/j
ournal/vol4/cs11/report.html
13
The perceptron
X1
w1
T
w2
X2
wN
XN
Input
Threshold Unit
Output
The perceptron classifies the input vector X into
two categories. If the weights and threshold T
are not known in advance, the perceptron must be
trained. Ideally, the perceptron must be trained
to return the correct answer on all training
examples, and perform well on examples it has
never seen. The training set must contain both
type of data (i.e. with 1 and 0 output).
14
The perceptron
Notes - The input is a vector X and the
weights can be stored in another vector
W. - the perceptron computes the dot product S
X.W - the output F is a function of S it is
often set discrete (i.e. 1 or 0), in which case
the function is the step function. For
continuous output, often use a sigmoid
1
1/2
0
0
15
The perceptron
Training a perceptron Find the weights W that
minimizes the error function
P number of training data Xi training
vectors F(W.Xi) output of the perceptron t(Xi)
target value for Xi
Use steepest descent - compute gradient -
update weight vector - iterate
(e learning rate)
16
Biological Neural Network
Image from http//en.wikipedia.org/wiki/Biological
_neural_network
17
Artificial Neural Network
A complete neural network is a set of
perceptrons interconnected such that the
outputs of some units becomes the inputs of
other units. Many topologies are possible!
Neural networks are trained just like perceptron,
by minimizing an error function
18
Neural networks and Secondary Structure prediction

Experience from Chou and Fasman and GOR has shown
that
In predicting the conformation of a residue, it
is important to consider a window around it.
Helices and strands occur in stretches
It is important to consider multiple sequences

19
PHD Secondary structure prediction using NN
20
PHD Input
For each residue, consider a window of size 13
13x20260 values
21
Sequence-gtStructure Network

for each amino acid aj, a window of 13 residues
aj-6ajaj6 is considered
The corresponding rows of the sequence profile
are fed into the neural network, and the output
is 3 probabilities for aj P(aj,alpha), P(aj,
beta) and P(aj,other)

22
PHD Network 1 Sequence Structure
13x20 values
3 values
Network1
Pa(i) Pb(i) Pc(i)
23
Structure-gtStructure network

For each aj, PHD considers now a window of 17
residues the probabilities P(ak,alpha),
P(ak,beta) and P(ak,other) for k in j-8,j8 are
fed into the second layer neural network, which
again produces probabilities that residue aj is
in each of the 3 possible conformation

24
PHD Network 2 Structure Structure
For each residue, consider a window of size 17
3 values
3 values
17x351 values
Network2
Pa(i) Pb(i) Pc(i)
Pa(i) Pb(i) Pc(i)
25
PHD

Jury system PHD has trained several neural
networks with different training sets all neural
networks are applied to the test sequence, and
results are averaged
Prediction For each position, the secondary
structure with the highest average score is
output as the prediction

26
Secondary Structure Prediction Methods

1st Generation methods
Ab initio - used relatively small dataset of
structures available
Chou-Fasman - based on amino acid propensities
(3-state)
GOR - also propensity-based (4-state)
2nd Generation methods
based on much larger datasets of structures now
available
GOR II, III, IV, SOPM, GOR V, FDM
3rd Generation methods
Homology-based Neural network based
PHD, PSIPRED, SSPRO, PROF, HMMSTR, CDM
Meta-Servers
combine several different methods
Consensus Ensemble based
JPRED, PredictProtein, Proteus

27
Secondary Structure Prediction Servers

Prediction Evaluation?
Q3 score - of residues correctly predicted
(3-state)
in cross-validation experiments
Best results? Meta-servers
http//expasy.org/tools/ (scroll for 2'
structure prediction)
http//www.russell.embl-heidelberg.de/gtsp/secstru
cpred.html
JPred www.compbio.dundee.ac.uk/www-jpred
PredictProtein http//www.predictprotein.org/
Rost, Columbia
Best "individual" programs? ??
CDM http//gor.bb.iastate.edu/cdm/
SenJernigan, ISU
FDM (not available separately as server)
ChengJernigan, ISU
GOR V http//gor.bb.iastate.edu/
KloczkowskyJernigan, ISU

28
Secondary Structure Prediction for Different
Types of Proteins/Domains

For Complete proteins
Globular Proteins - use methods previously
described
Transmembrane (TM) Proteins - use special
methods
(next slides)
For Structural Domains many under development
Coiled-Coil Domains (Protein interaction
domains)
Zinc Finger Domains (DNA binding domains),
others

29
SS Prediction for Transmembrane Proteins

Transmembrane (TM) Proteins
Only a few in the PDB - but 30 of cellular
proteins are membrane-associated !
Hard to determine experimentally, so prediction
important
TM domains are relatively 'easy' to predict!
Why? constraints due to hydrophobic environment
2 main classes of TM proteins
??- helical
?- barrel

30
SS Prediction for TM ?-Helices

??-Helical TM domains
Helices are 17-25 amino acids long (span the
membrane)
Predominantly hydrophobic residues
Helices oriented perpendicular to membrane
Orientation can be predicted using "positive
inside" rule
Residues at cytosolic (inside or cytoplasmic)
side of TM helix, near hydrophobic anchor are
more positively charged than those on lumenal
(inside an organelle in eukaryotes) or
periplasmic side (space between inner outer
membrane in gram-negative bacteria)
Alternating polar hydrophobic residues provide
clues to interactions among helices within
membrane
Servers?
TMHMM or HMMTOP - 70 accuracy - confused by
hydrophobic signal peptides (short hydrophobic
sequences that target proteins to the
endoplasmic reticulum, ER)
Phobius - 94 accuracy - uses distinct HMM
models for TM helices
signal peptide sequences

31
SS Prediction for TM ?-Barrels ?

?-Barrel TM domains ?
?-strands are amphipathic (partly hydrophobic,
partly hydrophilic)
Strands are 10 - 22 amino acids long
Every 2nd residue is hydrophobic, facing lipid
bilayer
Other residues are hydrophilic, facing "pore" or
opening
Servers? Harder problem, fewer servers
TBBPred - uses NN or SVM
Accuracy ?

32
Prediction of Coiled-Coil Domains

Coiled-coils
Superhelical protein motifs or domains, with two
or more interacting ?-helices that form a
"bundle"
Often mediate inter-protein ( intra-protein)
interactions
'Easy' to detect in primary sequence
Internal repeat of 7 residues (heptad)
1 4 hydrophobic (facing helical interface)
2,3,5,6,7 hydrophilic (exposed to solvent)
Helical wheel representation - can be used
manually detect these, based on amino acid
sequence
Servers?
Coils, Multicoil - probability-based methods
2Zip - for Leucine zippers special type of CC
in TFs
characterized by Leu-rich motif
L-X(6)-L-X(6)-L-X(6)-L