Title: Fisica Computazionale applicata alle Macromolecole
1Prediction of structural and functional features
in proteins starting from the residue
sequence INTRODUCTION TO NEURAL NETWORKS
2MAPPING PROBLEMS Secondary structure
Covalent structure TTCCPSIVARSNFNVCRLPGTPEAIC
ATYTGCIIIPGATCPGDYAN
3MAPPING PROBLEMS Topology of transmembrane
proteins
Topography
position of Trans Membrane Segments along the
sequence
ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK
ALALMLCMLTYRHKELKLKLKK
4First generation methods Single residue statistics
- Propensity scales
- For each residue
- The association between each residue and the
different features is statistically evaluated - Physical and chemical features of residues
- A propensity value for any structure can be
associated to any residue - HOW?
5Secondary structure Chou-Fasman propensity scale
Given a set of known structures we can count how
many times a residue is associated to a
structure. Example ALAKSLAKPSDTLAKSDFREKWEWLKL
LKALACCKLSAAL hhhhhhhhccccccccccccchhhhhhhhhhhhhh
hhhhh N(A,h) 7, N(A,c) 1, N 40 P(A,h)
7/40, P(A,h) 1/40 Is that enough for
estimating a propensity?
6Secondary structure Chou-Fasman propensity scale
Given a set of known structures we can count how
many times a residue is associated to a
structure. Example ALAKSLAKPSDTLAKSDFREKWEWLKL
LKALACCKLSAAL hhhhhhhhccccccccccccchhhhhhhhhhhhhh
hhhhh N(A,h) 7, N(A,c) 1, N 40 P(A,h)
7/40, P(A,h) 1/40 We need to estimate how much
independent the residue-to-structure association
is. P(h) 27/40, P(c) 13/40
7Secondary structure Chou-Fasman propensity scale
Given a set of known structures we can count how
many times a residue is associated to a
structure. Example ALAKSLAKPSDTLAKSDFREKWEWLKL
LKALACCKLSAAL hhhhhhhhccccccccccccchhhhhhhhhhhhhh
hhhhh N(A,h) 7, N(A,c) 1, N 40 P(A,h)
7/40, P(A,h) 1/40 P(h) 27/40, P(c)
13/40 If the structure is independent of the
residue P(A,h) P(A)P(h) The ratio
P(A,h)/P(A)P(h) is the propensity
8Secondary structure Chou-Fasman propensity scale
Given a LARGE set of examples, a propensity value
can be computed for each residue and each
structure type
9Secondary structure Chou-Fasman propensity scale
Given a new sequence a secondary structure
prediction can be obtained by plotting the
propensity values for each structure, residue by
residue Considering three secondary
structures (H,E,C), the overall accuracy, as
evaluated on an uncorrelated set of sequences
with known structure, is very low Q3 50/60
10Secondary structure Chou-Fasman propensity scale
http//www.expasy.ch/cgi-bin/protscale.pl
11Transmembrane alpha-helices Kyte-Doolittle scale
It is computed taking into consideration the
octanol-water partition coefficient, combined
with the propensity of the residues to be found
in known transmembrane helices Ala 1.800 Arg
-4.500 Asn -3.500 Asp -3.500 Cys 2.500
Gln -3.500 Glu -3.500 Gly -0.400 His
-3.200 Ile 4.500 Leu 3.800 Lys -3.900
Met 1.900 Phe 2.800 Pro -1.600 Ser
-0.800 Thr -0.700 Trp -0.900 Tyr -1.300
Val 4.200
12Second generation methods GOR
The structure of a residue in a protein strongly
depends on the sequence context It is possible
to estimate the influence of a residue in
determining the structure of a residue close
along the sequence. Usually windows from -8/8 to
-13/13 are considered. Coefficients P(A,s,i)
estimate the contribution of the residue A in
determining the structure s for a residue that is
i positions apart along the sequence
13Struttura secondaria Metodo GOR
Q3 65 (Considering three secondary
structures (H,E,C), and evaluating the overall
accuracy on an uncorrelated set of sequences with
known structure) The contribution of each
position in the window is independent of the
other ones. No correlation among the positions in
the window is taken in to account.
14A more efficient method Neural Networks
Alternative computing algorithm analogies with
the computation in the nervous system. 1) The
nervous systems is constituted of elementary
computing units neurons 2) The electric signal
flows in a determined direction (dentrites-gtaxon)
(Principle of dynamic polarization) 3)There is
not cytoplasmic continuity among the neurons.
Each neuron specifically communicates with some
neighboring neurons by means of synapses
(Principle of connective specificity)
15Tools out of machine learning approaches
Neural Networks can learn the mapping from
sequence to secondary structure
Training
Data Base Subset
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
General rules
EEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE
Known mapping
16Neural network for secondary structure prediction
Output
Input
M P I L K QK P I H Y H P N H G
E A K G
A 0 0 0 0 0 0 0 0 0 C 0 0 0 0 0 0 0 0 0 D 0
0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 0 0 F 0 0
0 0 0 0 0 0 0G 0 0 0 0 0 0 0 0 0H 0 0 0 1 0
1 0 0 1 I 0 0 1 0 0 0 0 0 0 K 1 0 0 0 0 0 0 0
0 L 0 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 N
0 0 0 0 0 0 0 1 0 P 0 1 0 0 0 0 1 0 0 Q 0 0 0
0 0 0 0 0 0 R 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 0
0 0 0 T 0 0 0 0 0 0 0 0 0 V 0 0 0 0 0 0 0 0
0 W 0 0 0 0 0 0 0 0 0 Y 0 0 0 0 1 0 0 0 0
Usually Input 17-23 residues Hidden neurons
4-15
17(No Transcript)
18Third generation methods evolutionary information
19The Network Architecture for Secondary Structure
Prediction
The First Network (Sequence to Structure)
20The Network Architecture for Secondary Structure
Prediction
The Second Network (Structure to Structure)
21The Performance on the Task of Secondary
Structure Prediction
22Combinando differenti reti Q3 76/78
23Secondary Structure Prediction
From sequence
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
To secondary structure
And to the reliability of the prediction
7997688899999988776886778999887679956889999999
24SERVERS
PredictProtein Burkhard Rost (Columbia
Univ.) http//cubic.bioc.columbia.edu/predictprot
ein/ PsiPRED David Jones (UCL) http//bioinf.cs
.ucl.ac.uk/psipred/ JPred Geoff Barton (Dundee
Univ.) SecPRED http//www.biocomp.unibo.it
25Chamaleon sequences
QEALEIA
26We extract
from a set of 822 non-homologous
proteins (174,192 residues)
2,452 5-mer chameleons 107 6-mer chameleons
16 7-mer chameleons 1 8-mer chameleon
The total number of residues in chameleons is
26,044 out of 755 protein chains (15)
27Prediction of the Secondary Structure of
Chameleon sequences with Neural Networks
28The Prediction of Chameleons with Neural Networks
29Other neural network-based predictors
- Secondary structure
- Topology of transmebrane proteins
- Cysteine bonding state
- Contact maps of proteins
- Interaction sites on protein surface
30Prediction of the cysteine bonding state
Tryparedoxin-I from Crithidia fasciculata (1QK8)
MSGLDKYLPGIEKLRRGDGEVEVKSLAGKLVFFYFSASWCPPCRGFTPQL
IEFYDKFHES KNFEVVFCTWDEEEDGFAGYFAKMPWLAVPFAQSEAVQK
LSKHFNVESIPTLIGVDADSG DVVTTRARATLVKDPEGEQFPWKDAP
Free cysteines
Cys68
Disulphide bonded cysteines
Cys40
Cys43
31A neural network-based method for predicting the
disulfide connectivity in proteins
32The Protein Folding
T T C C P S I V A R S N F N V C R L P G T P E A L
C A T Y T G C I I I P G A T C P G D Y A N
33The Protein Folding
RPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTFVYGGCRAKRNNFKSAED
CMRTCGGA
34Disulfide bonds
2-SH -gt -SS- 2H 2e- S-S distance ? 2.2
Å Torsion angle C-S-S-C ? 90 Bond Energy ? 3
Kcal/mol
35Intra-chain disulfide bonds in proteins
Of 1259 proteins (a non redundant PDB subset)
36Intra-chain disulfide bonds in proteins
Distribution of disulfide bonds in the SCOP
domains
- 99 of the disulfide bonds are intra-domain
37Problem no 1
Starting from the protein sequence can we
discriminate whether a cysteine residue is
disulfide-bonded?
Prediction of the disulfide-bonding state of
cysteines in proteins
38Perceptron (input sequence profile)
bonded
Non bonded
NGDQLGIKSKQEALCIAARRNLDLVLVAP
39Plotting the trained weigths
40It is possible to add a sintax?
Free states
Bonded states
41A path
42A path
P(seq) P(1 Begin) ? P(C40 1) ? ...
43A path
P(seq) P(1 Begin) ? P(C40 1) ? ...
? P(2 1) ? P(C43 2) ? ..
44A path
P(seq) P(1 Begin) ? P(C40 1) ? ...
? P(2 1) ? P(C43 2) ? ..
? P(4 2) ? P(C68 4) ? ..
45A path
Begin
1
2
3
4
P(seq) P(1 Begin) ? P(C40 1) ? ...
? P(2 1) ? P(C43 2) ? ..
? P(4 2) ? P(C68 4) ? .. ?
P(End 4)
End
464 possible paths
47(No Transcript)
48Prediction for Triparedoxin
49Prediction for Triparedoxin
50Prediction for Triparedoxin
51Performance
Neural Network
Hybrid system
B cysteine bonding state, Fcysteine free state.
WD whole database (969 proteins, 4136
cysteines) RD Reduced database, in which the
chains containing only one cysteine are removed
(782 proteins, 3949 cysteines).
- Martelli PL, Fariselli P, Malaguti L, Casadio R.
-Prediction of the disulfide bonding state of
cysteines in proteins with hidden neural
networks- Protein Eng. 15951-953 (2002)
52Problem no 2
When the bonding state of cysteines is known can
we predict the connectivity pattern of disulfide
bonds?
Prediction of the connectivity of disulfide bonds
in proteins
53Prediction of disulfide connectivity in proteins
Bovine trypsin Inhibitor 6PTI
54Prediction of disulfide connectivity in proteins
as a problem of maximum-weight perfect matching
Representation
Protein sequence
The undirected weighted graph with V2B vertices
(no of cysteines) and E2B(2B-1)/2 undirected
edges (strength of the interaction W)
55From the Graph Theory
- It is not necessary to compute all the possible
connectivity patterns (? (i ? B) (2i-1))
- Given a complete graph G(2B,E)
- the matching with the maximum weight can be
computed in a O((B)3) time - with the Edmonds-Gabows algorithm
Gabow, H.N. (1975). Technical
Report,CU-CS-075-75, Dept. of Comp. Sci. Colorado
University
56How to assign the costs (W) of the edges in the
graph
57Assumption for each cysteine all its sequence
nearest neighbours make contacts
58Frequency distribution of disulfide bonds with
respect to sequence separation (726 proteins)
59Neural Networks for predicting the edge values
60Accuracy (Qp) of EG vs NN
61- The state of art
- Prediction of bonding states is quite
satisfactory - Prediction of connectivity needs to be improved
62Prediction of Foldons
Piero Fariselli
63The Folding Problem as a Mapping Problem
Covalent structure TTCCPSIVARSNFNVCRLPGTPEAIC
ATYTGCIIIPGATCPGDYAN
64- We can collect from the PDB data base some 1500
chains of known structures from which to derive
non redundant information relating sequence to - secondary structure
- structural and functional motifs
- 3D structure
65- Evolutionary information
- Multiple Sequence Alignment (MSA) of similar
sequences - Sequence profile for each position a 20-valued
vector contains the aminoacidic composition of
the aligned sequences.
1 Y K D Y H S - D K K K G
E L - - 2 Y R D Y Q T - D
Q K K G D L - - 3 Y R D
Y Q S - D H K K G E L - -
4 Y R D Y V S - D H K K
G E L - - 5 Y R D Y Q F -
D Q K K G S L - - 6 Y K D
Y N T - H Q K K N E S -
- 7 Y R D Y Q T - D H K K
A D L - - 8 G Y G F G -
- L I K N T E T T K 9 T K
G Y G F G L I K N T E T
T K 10 T K G Y G F G L I
K N T E T T K A 0 0 0 0
0 0 0 0 0 0 0 10 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 D 0 0 70 0 0 0 0
60 0 0 0 0 20 0 0 0 E 0 0
0 0 0 0 0 0 0 0 0 0 70 0 0
0 F 0 0 0 10 0 33 0 0 0 0
0 0 0 0 0 0 G 10 0 30 0 30 0
100 0 0 0 0 50 0 0 0 0 H 0
0 0 0 10 0 0 10 30 0 0 0 0 0
0 0 K 0 40 0 0 0 0 0 0 10
100 70 0 0 0 0 100 I 0 0 0 0 0
0 0 0 30 0 0 0 0 0 0 0 L
0 0 0 0 0 0 0 30 0 0 0 0 0
0 0 0 M 0 0 0 0 0 0 0 0
0 0 0 0 0 60 0 0 N 0 0 0 0
10 0 0 0 0 0 30 10 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 Q 0 0 0 0 40 0 0
0 30 0 0 0 0 0 0 0 R 0 50 0
0 0 0 0 0 0 0 0 0 0 0 0
0 S 0 0 0 0 0 33 0 0 0 0 0
0 10 10 0 0 T 20 0 0 0 0 33
0 0 0 0 0 30 0 30 100 0 V 0 0
0 0 10 0 0 0 0 0 0 0 0 0
0 0 W 0 10 0 0 0 0 0 0 0 0
0 0 0 0 0 0 Y 70 0 0 90 0
0 0 0 0 0 0 0 0 0 0 0
MSA
sequence position
Sequence profile
66Prediction of Initiation Sites of Protein Folding
The Folding Process
67Frustration in proteins
- The simultaneous minimisation of all the
interaction energies is impossible
68The network architecture
..ALS.......QGFLLIARQPPFTYFTV......HW..
69The prediction efficiency of the network
Q2 0.85 Q(H) 0.67 Q(nonH)
0.93 SovHpred 0.85 C 0.63 Pc(H)
0.80 Pc(nonH) 0.86 SovHobs 0.76
70Theoretical background
The conformation of residue R depends both on
local (window W) and non local (context C)
interactions.
The convergence theorem ensures that Oi
Probability ( StructureR i W )
If , for any i, Oi 1 , then the structure of
residue R depends mainly on W and only slightly
on C
71- Averaging over all the contexts (performed by
NN)
- When the pattern is self-stabilising (W
dependent)
- Then the Anfinsens hypothesis can be cast in a
local form
72Relationship between the reliability index and
the Shannon entropy
73INPUT
MAS..... QLMLKDFLNRTPL.........GHI
.........
..........
Oa
O non-a
S Si Oi log Oi
_
74 Protein segments correctly predicted in
a-helical structure
Entropy Shannon-entropy in (ln 2)/10 units ( S
-S i o i ln ( o i ) ) NC Number of protein
segments correctly predicted in a-helix NT
Total number of protein segments predicted in
a-helix
75Profile of the smoothed entropy (S5) for the
hen egg lysozyme (132L)
S5
Protein chain
76Hen egg lysozyme (132L)
N-terminus
C-terminus
77Frequency distribution of predicted helical
segments as a function of their entropy
value
Threshold value
78An example of the data base of minimally
frustrated protein fragments
http//www.biocomp.unibo.it/DB/
79Training set from PDB
Data base of minimally frustrated a-helical
segments
80Comparison of minimally frustrated segments
with putative folding initiation sites
experimentally determined
Not yet experimentally detected
81Comparison of minimally frustrated segments with
peptides extracted from proteins
Muñoz and Serrano, 1994.
82Minimally frustrated a-helical segments are
useful for determining
- de-novo design of a-helices
83Structure prediction of membrane proteins
84(No Transcript)
85Outer Membrane proteins (all b-Transmembrane
proteins)
Inner Membrane proteins (all a-Transmembrane
proteins)
86Outer Membrane
Inner Membrane
?-barrel
?-helices
Bilayer
Bacteriorhodopsin (Halobacterium salinarum)
Porin (Rhodobacter capsulatus)
87Predictors of the Topology of Membrane Proteins
88Prediction of transmembrane segments
89Neural Network for the prediction of TMS in
?-barrel membrane proteins. (Jacoboni et al.,
2001)
TM
nonTM
2 output neurons
5 hidden neurons
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 70 0 0 0 0 60 0 0
0 0 0 0 0 0 0 0 0 0
0 10 0 33 0 0 0 10 0 30
0 30 0 100 0 0 0 0 0 0 10
0 0 10 30 0 40 0 0 0 0
0 0 10 0 0 0 0 0 0 0 0
30 0 0 0 0 0 0 0 30 0
0 0 0 0 0 0 0 0 0 0 0
0 0 10 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
40 0 0 0 30 0 50 0 0 0
0 0 0 0 0 0 0 0 0 33 0
0 0 20 0 0 0 0 33 0 0
0 0 0 0 0 10 0 0 0 0 0
10 0 0 0 0 0 0 0 70 0
0 90 0 0 0 0 0
Window 9 residues
90A generic model for membrane proteins (TMHMM)
91Sequence-profile-based HMM
0 85 0 0 5 0 0 0 0 2 0 8 0 0 0 0
0 0 0 0
0 0 0 0 4 0 13 0 4 0 5 0 6 0 0 23
0 1 44 0
0 0 22 0 23 0 0 5 0 23 0 3 0 11 0 0
2 0 11 0
0 34 0 0 0 24 0 0 0 0 0 2 0 22 0 18
0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 0 92 0 0 0
0 0 0 0
90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 77 0 23
3 0 2 7 4 0 8 6 1 3 6 5 5 12 5
6 17 2 2 6
For proteins A20
Constraints
0 ? st (n) ? S ? t,n S100
Martelli et al., Bioinformatics 18, S46-53, 2002
92- The new algorithms make possible
- to feed HMMs with sequence profiles
- to eventually couple NNs and HMMs (Hidden Neural
Networks) - Advantages
- Higher performance than standard HMMs
- Increased discrimination capability of a given
class
Martelli et al., Bioinformatics, 2002 Martelli et
al., Protein Eng. 2002,
93Prediction of the Topology of a-Transmembrane
Proteins
The prediction accuracy of topography is 92
The prediction accuracy of topology is 81
94Prediction of the Topology of b-Transmembrane
Proteins
The prediction accuracy of topography is 73
The prediction accuracy of topology is 73
LPS (Out)
Periplasmic (In)
Topology
position of N and C termini with respect to the
bilayer
95The discriminative capability of the HMM model
I(s M) -1/L log P(s M)
96An application modeling the 3D structure of
eukaryotic ? barrel proteins
973D structure prediction of proteins
New folds
Existing folds
Membrane proteins
Building by homology
Ab initio prediction
Threading/ fold recognition
0 10 20 30 40 50 60 70 80
90 100
Homology ()
98Structural alignment of VDAC with the template
99A low resolution 3D Model of VDAC the sequence
from Neurospora crassa)
Casa
100A low resolution 3D model of VDAC location of
mutated residues
Casadio et al., FEBS Lett 5201-7 (2002)
101Predictors of membrane protein structures can be
used to filter genomes and find new membrane
proteins without sequence homologoues
102FISHING NEW OUTER MEMBRANE PROTEINS IN
GRAM-NEGATIVE BACTERIA
103Proteins have intrinsic signals that govern their
transport and localization in the cell a
secretion hydrophic marker (or signal peptide)
Signal peptides in protein sequences
MRAKLLGIVLTTPIAISSFASTETLSFTPDNINADISLGTLSGKTKERVY
LAEEGGRKVSQLDWK FNNAAIIKGAINWDLMPQISIGAAGWTTLGSRGG
NMVDQDWMDSSNPGTWTDESRHPDTQL NYANEFDLNIKGWLLNEPNYRL
GLMAGYQESRYSFTARGGSYIYSSEEGFRDDIGSFPNGER AIGYKQRFK
MPYIGLTGSYRYEDFELGGTFKYSGWVESSDNDEHYDPGKRITYRSKVKD
QNY YSVAVNAGYYVTPNAKVYVEGAWNRVTNKKGNTSLYDHNNNTSDYS
KNGAGIENYNFITTAG LKYTF
Sequences of outer membrane proteins have signal
peptides the secretion marker is also a marker
of outer membrane proteins
104Signal Peptide prediction
Signal Pepetide Mature protein
Cleavage site
1052 Neural Networs
CleavageNet
SignalNet
Predicts if a given residue position belongs to
the Signal Pepetide
Predicts if a given residue position is the
cleavage site
106SignalNet Accuracy
107CleavageNet Accuracy
108(No Transcript)
109Performance of SignalNN on 2160 annotated proteins
Q2 96 Qsignal 82
Qnon-signal 97 Psignal 78
Pnon-signal 98
110Predictors of Membrane Topography Rate of false
positives
The predictors are tested on on 809 globular
protein with sequence identity ? 25
0.5 have at least 1 a-TM helix predicted
5.6 have at least 2 b-TM strand
predicted
111PROTEOME
HUNTER
Signal peptide
Yes
No
All-a TM
All-a TM
No
All-b TM
112Predicting globular, inner and outer membrane
proteins in genomes of Gram-negative bacteria
with Hunter
the number of new proteins predicted in the
class with Hunter, out of the non-annotated
region
113(No Transcript)