Title: PREDICTING PROTEIN SECONDARY STRUCTURE USING ARTIFICIAL NEURAL NETWORKS
1PREDICTING PROTEIN SECONDARY STRUCTURE USING
ARTIFICIAL NEURAL NETWORKS
- Sudhakar Reddy
- Patrick Shih
- Chrissy Oriol
- Lydia Shih
2Proteins And Secondary Structure
Sudhakar Reddy
3Project Goals
- To predict the secondary structure of a protein
using artificial neural networks.
4STRUCTURES
- Primary structure linear arrangement of amino
acid (a.a) residues that constitute the
polypeptide chain.
5SECONDARY STRUCTURE
- Localized organization of parts of a polypeptide
chain, through hydrogen bonds between different
residues. - Without any stabilizing interactions , a
polypeptide assumes random coil structure. - When stabilizing hydrogen bond forms, the
polypeptide backbone folds periodically in to one
of two geometric arrangements viz. - ALPHA HELIX
- BETA SHEET
- U-TURNS
6ALPHA HELIX
- A polypeptide back bone is folded in to spiral
that is held in place by hydrogen bonds between
backbone oxygen atoms and hydrogen atoms. - The carbonyl oxygen of each peptide bond is
hydrogen bonded to the amide hydrogen of the a.a
4 residues toward the C-terminus - Each alpha helix has 3.6 a.a per turn
- From the backbone side chains point outward
- Hydrophobic/hydrophilic quality of the helix is
determined entirely by side chains, because polar
groups of the peptide backbone are already
involved H-bonding in the helix and thus are
unable to affect its hydrophobic/hydrophilic.
7ALPHA HELIX
8THE BETA SHEET
- Consists of laterally packed beta strands
- Each beta strand is a short (5-8 residues),
nearly fully extended polypeptide chain - Hydrogen bonding between backbone atoms in a
adjacent beta strands, within either the same or
different polypeptide chains forms a beta sheet. - Orientation can be either parallel or
anti-parallel. In both arrangements side chains
project from both faces of the sheet.
9THE BETA SHEET
10THE BETA SHEET
11TURNS
- Composed of 3-4 residues , are compact, U-shaped
secondary structures stabilized by H-bonds
between their end residues. - Located on the surface of the protein, forming a
sharp bend that redirects the polypeptide
backbone back toward the interior. - Glycine and proline are commonly present.
- Without these turns , a protein would be large,
extended and loosely packed. -
12TURNS
13MOTIFS
- MOTIFS regular combinations of secondary
structure. - Coiled coil motif
- Helix-loop-helix(Ca)
- Zinc finger motif.
14COILED-COIL MOTIF
15HELIX-LOOP-HELIX (CA)
16ZINC-FINGER MOTIF
17FUTURE
- Protein structure identification is key to
understanding biological function and its role in
health and disease - Characterizing a protein structure helpful in the
development of new agents and devices to treat
disease - Challenge of unraveling the structure lies in
developing methods for accurately and reliably
understanding this relationship - Most of the current protein structures have been
characterized by NMR and X-Ray diffraction - Revolution in sequencing studies-growing data
base-only 3000 known structures
18ADVANTAGE
- Very few confirmations of protein are possible
and structure and sequence are directly related
to each other, we can unravel the secondary
structure by developing an efficient algorithm,
which compares new sequences with the ones
available, and use them in health care industry.
19WHY SECONDARY STRUCTURE?
- Prediction of secondary structure is an
essential intermediate step on the way to
predicting the full 3-D structure of a protein - If the secondary structure of a protein is known,
it is possible to derive a comparatively small
number of possible tertiary structures using
knowledge about the ways that secondary
structural elements pack
20Artificial Neural Network (ANN)
Peichung Shih
21Biological Neural Network
22Artificial Neural Network
X1k Input from X1 X2k Input from X2
W1k Weight of X1 W2k Weight of X2
X0k Bias term W0k Weight of bias term
qk Output of node k
23Artificial Neural Network - Example
7
1
Output 1
24Paradigms of ANN - Overview
- Perceptron
- Adaline Madaline
- Backpropagation (BP)
25Paradigms of ANN - Feedforward
26Paradigms of ANN - feedback
27Paradigms of ANN - supervised
28Paradigms of ANN - Unsupervised
29Paradigms of ANN - Overview
- Perceptron
- Adaline Madaline
- Backpropagation (BP)
30- One of the earliest learning networks was
proposed by - Rosenblatt in the late 1950's.
RULE net w1I1 w2I2 if net gt Q then output
1, otherwise o 0.
MODEL
31- Perceptron Example AND Operation
Initial Network
W W 1
32- Perceptron Example AND Operation
33- Perceptron Example AND Operation
34- Perceptron Example AND Operation
35- Perceptron Example AND Operation
36- Perceptron Example AND Operation
37- Perceptron Example AND Operation
38- Perceptron Example AND Operation
39- Perceptron Example AND Operation
40 41 420.5
1
1
- 2
1
1
1.5
1
1
1
1
1
1
43How Many Hidden Nodes?
We have indicated the number of layers needed.
However, no indication is provided as to the
optimal number of nodes per layer. There is no
formal method to determine this optimal number
typically, one uses trial and error.
44Hidden Units Q3() 0 62.50 5
61.60 10 61.50 15 62.60
20 62.30 30 62.50 40 62.70
60 61.40
45JNET AND JPRED
CHRISSY ORIOL
46JNET
- Multiple Alignement
- Neural Network
- Consensus of methods
47TRAINING AND TESTS
- 480 proteins train (1996 PDB)
- 406 proteins test (2000 PDB)
- Blind test
- 7-fold cross validation test
48MULTIPLE ALIGNMENTS
49ALIGNMENTS
- Multiple sequence alignment constructed
- Generation of profiles
- Frequency counts of each residue / total
residue in the column (expressed as percentage) - Each residue scored by its value from BLOSUM62
and the scores were averaged based on the number
of sequence in that column - Profile HMM generated by HMMER2
- PSI-BLAST (Position Specific Iterative Basic
Local Alignment Search Tool) - Frequency of residue
- PSSM (Position Specific Scoring Matrix)
50HMM PROFILE
- Uses
- Statistical descriptions of a sequence family's
consensus - Position-specific scores for residues,
insertions and deletions - Profiles
- Captures important information about the degree
of conservation at different positions - Varying degree to which gaps and insertions and
deletions are permitted
51PSI-BLAST PROFILE
Remove gaps in a and the column below the gaps
to form a restrained profile which better
represents sequence a
Align a and b
Full length seq. from the initial PSIBlast
search, extracted from the database, and ordered
by p-value
Align c to profile of a and b
Iterate addition of each sequence from PSIBlast
search until all are aligned
Alignment profile based on the query sequence to
be predicted
52PSI-BLAST PROFILE
- Iterative
- Low complexity sequences polluted searching
profile - Filtered database to mask out
- Low complexity sequences (SEG)
- Coiled-coil regions (HELIXFILT)
- Transmembrane helices (HELIXFILT)
53NUERAL NETWORK
54NUERAL NETWORK
- Two Nueral Network Used
- 1st
- Sliding window of 17 residues
- 9 hidden nodes
- 3 outputs
- 2nd
- Sliding window of 19 residue
- 9 hidden nodes
- 3 outputs
55CONSENSUS COMBINATION OF PREDICTION METHODS
56CONSENSUS COMBINATION OF PREDICTION METHODS
- Jury Agreement (Identical predictions by all
methods Q3 82) - No Jury (Q3 76.4)
- Trained another neural network
57ASSESMENT OF ACCURACY
Segment Overlap
Confidence 10 C (outmax - outnext)
58RIBONUCLEASE A
KEY H helix E strand B - buried
residue - exposed residue no jury
59JNET OUTPUT
YourSeq MRQQLEMQKKQIMMQILTPEARSRLANLRLTRPDF
VEQIELQLIQLAQMGRVRSKITDEQLKELLKRVAGKKREIKISRK
YourSeq YA60_PYRHO ERALIEAQIQAILRKILTPEARERL
ARVKLVRPELARQVELILVQLYQAGQITERIDDAKLKRILAQIEAKRREF
RIKW. YA60_PYRHO TF19_HUMAN
..KHREAEMRSILAQVLDQSARARLSNLALVKPEKTKAVENYLIQMARYG
QLSEKVSEQGLIEILKKVSQQEKTTTVKFN TF19_HUMAN
Q9VUZ8 ..MRAQEEMKSILSQVLDQQARARLNTLKVSKPE
KAQMFENMVIRMAQMGQVRGKLDDAQFVSILESVNAQQSKSSVKYD
Q9VUZ8 YRGK_CAEEL ARAENQETAKGMISQILDQAAMQRLS
NLAVAKPEKAQMVEAALINMARRGQLSGKMTDDGLKALMERVSAQQKATS
VKFD YRGK_CAEEL Y691_METJA
..ALLEAEMQALLRKILTPEARERLERIRLARPEFAEAVEVQLIQLAQLG
RLPIPLSDEDFKALLERISALKRKREIKIV Y691_METJA
YK68_ARCFU MRRQVEAQKKAILRAILEPEAKERLSRLKLAHPE
IAEAVENQLIYLAQAGRIQSKITDKMLVEILKRVQPKKRETRIIRK
YK68_ARCFU YF69_SCHPO ..QEVQDEMRNLLSQILEHPAR
DRLRRIALVRKDRAEAVEELLLRMAKTGQISHKISEPELIELLEKISGEK
RNETKIVI YF69_SCHPO YMW4_YEAST
.AGGGENSAPAAIANFLEPQALERLSRVALVRRDRAQAVETYLKKLIATN
NVTHKITEAEIVSILNGIAKQQNNSKIIFE YMW4_YEAST Â
1---------11--------21--------31-----
---41--------51--------61--------71--------
OrigSeq MRQQLEMQKKQIMMQILTPEARSRLANLRLTRP
DFVEQIELQLIQLAQMGRVRSKITDEQLKELLKRVAGKKREIKISRK
OrigSeq  jalign --HHHHHHHHHHHHHHHHHHHHHHH
HHHHH---HHHHHHHHHHHHHHHH--------HHHHHHHHHHHHHH----
EE--- jalign jfreq -HHHHHHHHHHHHHHHHHHH
HHHHHHHHHH---HHHHHHHHHHHHHHHH--------HHHHHHHHHHHH-
---EEEEE-- jfreq jhmm
-HHHHHHHHHHHHHHH---HHHHHHHHHHH----HHHHHHHHHHHHHHH-
-------HHHHHHHHHHHHHH---EEEEE- jhmm jnet
-HHHHHHHHHHHHHHHHHHHHHHHHHHHHH---HHHHHHHHHHHHHH
HH--------HHHHHHHHHHHH-----EEEEE- jnet jpssm
--HHHHHHHHHHHHHH--HHHHHHH-HEEEE---HHHHHHHHHH
HHHHH--------HHHHHHHHHHHH-----EEE---
jpssm  Jpred -HHHHHHHHHHHHHHHHHHHHHHHHHH
HHH---HHHHHHHHHHHHHHHH--------HHHHHHHHHHHH-----EEE
E-- Jpred  MCoil ---------------------
--------------------------------------------------
--------- MCoil MCoilDI
--------------------------------------------------
------------------------------ MCoilDI MCoilTRI
--------------------------------------------
------------------------------------
MCoilTRI Lupas 21 --------------------------
--------------------------------------------------
---- Lupas 21 Lupas 14 -------------------
--------------------------------------------------
----------- Lupas 14 Lupas 28
--------------------------------------------------
------------------------------ Lupas
28 Â Jnet_25 ---BB---B--BBB-BB---B--BB--B-B
B---BB-BBB-BBB-BB-BB-B---B----BB-BB--B--------B---
Jnet_25 Jnet_5 -----------BB--B----B---
B--B----------B---B--B--------------B--BB---------
------ Jnet_5 Jnet_0 -------------------
-------------------B---B--B--------------B--------
----------- Jnet_0 Jnet Rel
79889998888998643697888849188454657899999999988626
987657778999999986007883747728 Jnet Rel
60JPRED SERVER Consensus web server
- JNET default method
- PREDATOR
- Neural network focused on predicting hydrogen
bonds - PHD - PredictProtein
- Neural network focused on predicting hydrogen
bonds
61JPRED SERVER cont.
- NNSSP Nearest-neighbor SS prediction
-
- DSC Discrimination of protein Secondary
- structure Class
- Based on dividing secondary structure
prediction into the basic concepts for prediction
and then use simple and linear statistical
methods to combine the concepts for prediction - ZPRED
- physiochemical information
- MULPRED
- Single sequence method combination
62YourSeq MRQQLEMQKKQIMMQILTPEARSRLANLRLTRPDF
VEQIELQLIQLAQMGRVRSKITDEQLKELLKRVAGKKREIKISRK
YourSeq YA60_PYRHO ERALIEAQIQAILRKILTPEARERL
ARVKLVRPELARQVELILVQLYQAGQITERIDDAKLKRILAQIEAKRREF
RIKW. YA60_PYRHO TF19_HUMAN
..KHREAEMRSILAQVLDQSARARLSNLALVKPEKTKAVENYLIQMARYG
QLSEKVSEQGLIEILKKVSQQEKTTTVKFN TF19_HUMAN
Q9VUZ8 ..MRAQEEMKSILSQVLDQQARARLNTLKVSKPE
KAQMFENMVIRMAQMGQVRGKLDDAQFVSILESVNAQQSKSSVKYD
Q9VUZ8 YRGK_CAEEL ARAENQETAKGMISQILDQAAMQRLS
NLAVAKPEKAQMVEAALINMARRGQLSGKMTDDGLKALMERVSAQQKATS
VKFD YRGK_CAEEL Y691_METJA
..ALLEAEMQALLRKILTPEARERLERIRLARPEFAEAVEVQLIQLAQLG
RLPIPLSDEDFKALLERISALKRKREIKIV Y691_METJA
YK68_ARCFU MRRQVEAQKKAILRAILEPEAKERLSRLKLAHPE
IAEAVENQLIYLAQAGRIQSKITDKMLVEILKRVQPKKRETRIIRK
YK68_ARCFU YF69_SCHPO ..QEVQDEMRNLLSQILEHPAR
DRLRRIALVRKDRAEAVEELLLRMAKTGQISHKISEPELIELLEKISGEK
RNETKIVI YF69_SCHPO YMW4_YEAST
.AGGGENSAPAAIANFLEPQALERLSRVALVRRDRAQAVETYLKKLIATN
NVTHKITEAEIVSILNGIAKQQNNSKIIFE YMW4_YEAST
consv --3-273433568336-522-43--258385738
36556-2384484316682-37581274298238323542-3422-
consv 1---------11--------21-------
-31--------41--------51--------61--------71-------
- OrigSeq MRQQLEMQKKQIMMQILTPEARSRLANLRLT
RPDFVEQIELQLIQLAQMGRVRSKITDEQLKELLKRVAGKKREIKISRK
OrigSeq  jalign --HHHHHHHHHHHHHHHHHHHHH
HHHHHHH---HHHHHHHHHHHHHHHH--------HHHHHHHHHHHHHH--
--EE--- jalign jfreq
-HHHHHHHHHHHHHHHHHHHHHHHHHHHHH---HHHHHHHHHHHHHHHH-
-------HHHHHHHHHHHH----EEEEE-- jfreq jhmm
-HHHHHHHHHHHHHHH---HHHHHHHHHHH----HHHHHHHHHHHH
HHH--------HHHHHHHHHHHHHH---EEEEE- jhmm jnet
-HHHHHHHHHHHHHHHHHHHHHHHHHHHHH---HHHHHHHHHH
HHHHHH--------HHHHHHHHHHHH-----EEEEE-
jnet jpssm --HHHHHHHHHHHHHH--HHHHHHH-HEEE
E---HHHHHHHHHHHHHHH--------HHHHHHHHHHHH-----EEE---
jpssm mul --HHHHHHHHHHHHHHHHH--HHHHH
HHH-H--HHHHHHHHHHHHHH----------HHHHHHHHHHHHHHH--H-
EEE- mul nnssp HHHHHHHHHHHHHHHHHHHHHHHH
HHHHHH--HHHHHHHHHHHHHHHHH--------HHHHHHHHHHHHH----
-EEEEE nnssp phd ---HHHHHHHHHHHHHHHHH
HHHHHHHHHH--HHHHHHHHHHHHHHHHH--------HHHHHHHHHHHHH
H----EEE-- phd pred
---HHHHHHHHHHHHHHHHHHHHHHHHHHHHH-HHHHHHHHHHHHHHH--
-----HHHHHHHHHHHHHHHHHHHHH---- pred zpred
--HHHHHHHHHHHHHEHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
HH-EE----HHHHHHHHHHHHHHHHH---EE-- zpred  Jpred
-HHHHHHHHHHHHHHHHHHHHHHHHHHHHH---HHHHHHHH
HHHHHHHH--------HHHHHHHHHHHHH----EEEE--
Jpred PHDHtm ----------------------------
--------------------------------------------------
-- PHDHtm MCoil -----------------------
--------------------------------------------------
------- MCoil MCoilDI -------------------
--------------------------------------------------
----------- MCoilDI MCoilTRI
--------------------------------------------------
------------------------------ MCoilTRI Lupas
21 -----------------------------------------
--------------------------------------- Lupas
21 Lupas 14 --------------------------------
------------------------------------------------
Lupas 14 Lupas 28 ------------------------
--------------------------------------------------
------ Lupas 28 Â PHDacc
----B---B-BBBBBBB---B---BB-B-BB----B-BB-BBBB-BB-BB
-B---B----B--BB--B------B-B-U- PHDacc Jnet_25
---BB---B--BBB-BB---B--BB--B-BB---BB-BBB-BBB-
BB-BB-B---B----BB-BB--B--------B---
Jnet_25 Jnet_5 -----------BB--B----B---B--
B----------B---B--B--------------B--BB------------
--- Jnet_5 Jnet_0 ----------------------
----------------B---B--B--------------B-----------
-------- Jnet_0 Â PHD Rel
97527999999999999899999999986315269999999999999964
332235649999999999962356225319 PHD Rel Pred Rel
00777700999990990609990999886606668099999999
009677787757768989909999957077777000 Predator
Rel Jnet Rel 7988999888899864369788884918845
4657899999999988626987657778999999986007883747728
Jnet Rel
63Accuracy Evaluation
64Methods
- Per-residue accuracy
- Q3 measurement traditional way
- Mathews correlation coefficient
- Per-segment accuracy
- SOV measurement CASP2
- Subcategorizing the incorrect prediction
- Over predict alpha/beta when it is coil
- Under predict coil when it is alpha/beta
- Wrong predict alpha when it is beta or vice
versa
65How to measure Q3
- Qindex
- Qhelix, Qstrand and Qcoil for a single
conformational state - Qi (number of residues correctly
predicted - in state i)/(number of residues
observed in state i) x 100 - Q3 for all three states
- Q3 (number of residues correctly
predicted)/(number of all residues) x 100
66How to measure Matthew coefficients
67Problems in per-residue accuracy
- It does not reflect 3D structure.
- Example assigning the entire myoblobin chain as
a single helix gives a Q3 score of 80. - Conformational variation observed at secondary
structure segment ends. - Example low Q3 value but can predict folding
well.
68Q What is a good measure?A A structurally
oriented measure
- A structurally oriented measure consider the
following.. - Type and position of secondary structure segments
rather than a per-residue assignment of
conformational state. - Natural variation of segment boundaries among
families of homologous proteins.
69How to measure SOV
70SOV Example
- Observed (S1) CCEEECCCCCCEEEEEECCC
- Predicted (S2) CCCCCCCEEEEECCCEECCC
Minov
- Maxov
-
71SOV Example Cont.
EEECCCCCCEEEEEE
S(E) S(E) S(E)
S(E)
minov(s1, s2) delta(s1,s2) / maxov(s1, s2)
Delta(s1,s2)min(10-1)(1)(15/2)(10/2)
Delta(s1,s2)min(6-2)(2)(15/2)(10/2)
72Evaluation-Step 1(query sequence)
- Hypothetical Protein
- MRQQLEMQKKQIMMQILTPEARSRLANLRLTRPDFVEQIQLIQLAQMGR
VRSKITDEQLKELLKRVAGKKREIKISRK - 80 residues
- Methanothermobacter thermautotrophicus
- Structures solved by NMR
- Christendat,D., et al. Nat. Struct. Biol. 7 (10),
903-909 (2000)
73Evaluation-Step 2 (programs)
Â
74Severs
- APSSPhttp//imtech.ernet.in/raghava/apssp/
- JPred http//jura.ebi.ac.uk8888/
- PHDhttp//cubic.bioc.columbia.edu/predictprotein
- PROFsechttp//cubic.bioc.columbia.edu/predictprote
in - PSIpredhttp//insulin.brunel.ac.uk/psiform.html
- SAM-T99sec http//www.cse.ucsc.edu/research/compbi
o/HMM-apps/T99-query.html -
75Evaluation-Step 3
- Conversion of DSSP secondary structure from 8
states to 3 states
H alpha helix E beta strand L coil (others)
76Evaluation-Step 4
- First column protein sequence (AA) in one-letter
code - Second column observed (OSEC) secondary
structure - Third column predicted (PSEC) secondary
structure - http//predictioncenter.llnl.gov/local/sov/sov.htm
l
77Evaluation-Result
Â
78EVA Evaluation of Automatic protein structure
prediction
http//cubic.bioc.columbia.edu/eva/sec/graph/commo
n3.jpg
79Conclusion
- Jpred is the pioneer of methods which give high
Q3 and SOV scores. - The 2ndary structure prediction using a jury of
neural networks is one of the best methods.
80REFERENCES
- Cuff JA, Clamp ME, Siddiqui AS, Finlay M, Barton
GJ. Jpred A consensus secondary structure
prediction server, Bioinformatics,
199814892-893. - Cuff,J.A. and Barton, G.J. Evaluation and
improvement of multiple sequence methods for
protein secondary structure prediction.
Proteins Structure, Functions, and Genetics,
199934508-519. - Cuff,J.A. and Barton, G.J. Application of
multiple sequence alignment profiles to improve
protein secondary structure prediction.
Proteins Structure, Functions, and Genetics,
200040502-511.
4. Zemla et al. A modified definition of Sov,
a Segment-Based Measure for Protein Secondary
Structure Prediction Assessment. Protein
199934220-223 Â 5. Defay T, Cohen F.
Evaluation of current techniques for ab initio
protein structure prediction. Proteins 1995
23431-445. Â 6. Barton GJ. Protein secondary
structure prediction. Curr Opin Struct Biol 1995
5372-376 Â 7. Schulz GE. A critical evaluation
of methods for prediction of secondary
structures. Ann Rev Biophys Chem 1988
171-21 Â 8. Zhu Z-Y. A new approach to the
evaluation of protein secondary structure
predictions at the level of the elements of
secondary strucuter. Protein Eng 1995 8103-108