Fisica Computazionale applicata alle Macromolecole

About This Presentation

Title:

Fisica Computazionale applicata alle Macromolecole

Description:

Title: Fisica Computazionale applicata alle Macromolecole Author: biocomp Last modified by: Piero Created Date: 5/25/2003 2:13:37 PM Document presentation format – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 114

Provided by: bioc161

Category:

more less

Transcript and Presenter's Notes

Title: Fisica Computazionale applicata alle Macromolecole

1
Prediction of structural and functional features
in proteins starting from the residue
sequence INTRODUCTION TO NEURAL NETWORKS
2
MAPPING PROBLEMS Secondary structure
Covalent structure TTCCPSIVARSNFNVCRLPGTPEAIC
ATYTGCIIIPGATCPGDYAN
3
MAPPING PROBLEMS Topology of transmembrane
proteins
Topography
position of Trans Membrane Segments along the
sequence
ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK
ALALMLCMLTYRHKELKLKLKK
4
First generation methods Single residue statistics

Propensity scales
For each residue
The association between each residue and the
different features is statistically evaluated
Physical and chemical features of residues
A propensity value for any structure can be
associated to any residue
HOW?

5
Secondary structure Chou-Fasman propensity scale
Given a set of known structures we can count how
many times a residue is associated to a
structure. Example ALAKSLAKPSDTLAKSDFREKWEWLKL
LKALACCKLSAAL hhhhhhhhccccccccccccchhhhhhhhhhhhhh
hhhhh N(A,h) 7, N(A,c) 1, N 40 P(A,h)
7/40, P(A,h) 1/40 Is that enough for
estimating a propensity?
6
Secondary structure Chou-Fasman propensity scale
Given a set of known structures we can count how
many times a residue is associated to a
structure. Example ALAKSLAKPSDTLAKSDFREKWEWLKL
LKALACCKLSAAL hhhhhhhhccccccccccccchhhhhhhhhhhhhh
hhhhh N(A,h) 7, N(A,c) 1, N 40 P(A,h)
7/40, P(A,h) 1/40 We need to estimate how much
independent the residue-to-structure association
is. P(h) 27/40, P(c) 13/40
7
Secondary structure Chou-Fasman propensity scale
Given a set of known structures we can count how
many times a residue is associated to a
structure. Example ALAKSLAKPSDTLAKSDFREKWEWLKL
LKALACCKLSAAL hhhhhhhhccccccccccccchhhhhhhhhhhhhh
hhhhh N(A,h) 7, N(A,c) 1, N 40 P(A,h)
7/40, P(A,h) 1/40 P(h) 27/40, P(c)
13/40 If the structure is independent of the
residue P(A,h) P(A)P(h) The ratio
P(A,h)/P(A)P(h) is the propensity
8
Secondary structure Chou-Fasman propensity scale
Given a LARGE set of examples, a propensity value
can be computed for each residue and each
structure type
9
Secondary structure Chou-Fasman propensity scale
Given a new sequence a secondary structure
prediction can be obtained by plotting the
propensity values for each structure, residue by
residue Considering three secondary
structures (H,E,C), the overall accuracy, as
evaluated on an uncorrelated set of sequences
with known structure, is very low Q3 50/60
10
Secondary structure Chou-Fasman propensity scale
http//www.expasy.ch/cgi-bin/protscale.pl
11
Transmembrane alpha-helices Kyte-Doolittle scale
It is computed taking into consideration the
octanol-water partition coefficient, combined
with the propensity of the residues to be found
in known transmembrane helices Ala 1.800 Arg
-4.500 Asn -3.500 Asp -3.500 Cys 2.500
Gln -3.500 Glu -3.500 Gly -0.400 His
-3.200 Ile 4.500 Leu 3.800 Lys -3.900
Met 1.900 Phe 2.800 Pro -1.600 Ser
-0.800 Thr -0.700 Trp -0.900 Tyr -1.300
Val 4.200
12
Second generation methods GOR
The structure of a residue in a protein strongly
depends on the sequence context It is possible
to estimate the influence of a residue in
determining the structure of a residue close
along the sequence. Usually windows from -8/8 to
-13/13 are considered. Coefficients P(A,s,i)
estimate the contribution of the residue A in
determining the structure s for a residue that is
i positions apart along the sequence
13
Struttura secondaria Metodo GOR
Q3 65 (Considering three secondary
structures (H,E,C), and evaluating the overall
accuracy on an uncorrelated set of sequences with
known structure) The contribution of each
position in the window is independent of the
other ones. No correlation among the positions in
the window is taken in to account.
14
A more efficient method Neural Networks
Alternative computing algorithm analogies with
the computation in the nervous system. 1) The
nervous systems is constituted of elementary
computing units neurons 2) The electric signal
flows in a determined direction (dentrites-gtaxon)
(Principle of dynamic polarization) 3)There is
not cytoplasmic continuity among the neurons.
Each neuron specifically communicates with some
neighboring neurons by means of synapses
(Principle of connective specificity)
15
Tools out of machine learning approaches
Neural Networks can learn the mapping from
sequence to secondary structure
Training
Data Base Subset
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
General rules
EEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE
Known mapping
16
Neural network for secondary structure prediction
Output
Input
M P I L K QK P I H Y H P N H G
E A K G
A 0 0 0 0 0 0 0 0 0 C 0 0 0 0 0 0 0 0 0 D 0
0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 0 0 F 0 0
0 0 0 0 0 0 0G 0 0 0 0 0 0 0 0 0H 0 0 0 1 0
1 0 0 1 I 0 0 1 0 0 0 0 0 0 K 1 0 0 0 0 0 0 0
0 L 0 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 N
0 0 0 0 0 0 0 1 0 P 0 1 0 0 0 0 1 0 0 Q 0 0 0
0 0 0 0 0 0 R 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 0
0 0 0 T 0 0 0 0 0 0 0 0 0 V 0 0 0 0 0 0 0 0
0 W 0 0 0 0 0 0 0 0 0 Y 0 0 0 0 1 0 0 0 0
Usually Input 17-23 residues Hidden neurons
4-15
17
(No Transcript)
18
Third generation methods evolutionary information
19
The Network Architecture for Secondary Structure
Prediction
The First Network (Sequence to Structure)
20
The Network Architecture for Secondary Structure
Prediction
The Second Network (Structure to Structure)
21
The Performance on the Task of Secondary
Structure Prediction
22
Combinando differenti reti Q3 76/78
23
Secondary Structure Prediction
From sequence
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
To secondary structure
And to the reliability of the prediction
7997688899999988776886778999887679956889999999
24
SERVERS
PredictProtein Burkhard Rost (Columbia
Univ.) http//cubic.bioc.columbia.edu/predictprot
ein/ PsiPRED David Jones (UCL) http//bioinf.cs
.ucl.ac.uk/psipred/ JPred Geoff Barton (Dundee
Univ.) SecPRED http//www.biocomp.unibo.it
25
Chamaleon sequences
QEALEIA
26
We extract
from a set of 822 non-homologous
proteins (174,192 residues)
2,452 5-mer chameleons 107 6-mer chameleons
16 7-mer chameleons 1 8-mer chameleon
The total number of residues in chameleons is
26,044 out of 755 protein chains (15)
27
Prediction of the Secondary Structure of
Chameleon sequences with Neural Networks
28
The Prediction of Chameleons with Neural Networks
29
Other neural network-based predictors

Secondary structure
Topology of transmebrane proteins
Cysteine bonding state
Contact maps of proteins
Interaction sites on protein surface

30
Prediction of the cysteine bonding state
Tryparedoxin-I from Crithidia fasciculata (1QK8)
MSGLDKYLPGIEKLRRGDGEVEVKSLAGKLVFFYFSASWCPPCRGFTPQL
IEFYDKFHES KNFEVVFCTWDEEEDGFAGYFAKMPWLAVPFAQSEAVQK
LSKHFNVESIPTLIGVDADSG DVVTTRARATLVKDPEGEQFPWKDAP
Free cysteines
Cys68
Disulphide bonded cysteines
Cys40
Cys43
31
A neural network-based method for predicting the
disulfide connectivity in proteins
32
The Protein Folding
T T C C P S I V A R S N F N V C R L P G T P E A L
C A T Y T G C I I I P G A T C P G D Y A N
33
The Protein Folding
RPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTFVYGGCRAKRNNFKSAED
CMRTCGGA
34
Disulfide bonds
2-SH -gt -SS- 2H 2e- S-S distance ? 2.2
Å Torsion angle C-S-S-C ? 90 Bond Energy ? 3
Kcal/mol
35
Intra-chain disulfide bonds in proteins
Of 1259 proteins (a non redundant PDB subset)
36
Intra-chain disulfide bonds in proteins
Distribution of disulfide bonds in the SCOP
domains

99 of the disulfide bonds are intra-domain

37
Problem no 1
Starting from the protein sequence can we
discriminate whether a cysteine residue is
disulfide-bonded?
Prediction of the disulfide-bonding state of
cysteines in proteins
38
Perceptron (input sequence profile)
bonded
Non bonded
NGDQLGIKSKQEALCIAARRNLDLVLVAP
39
Plotting the trained weigths
40
It is possible to add a sintax?
Free states
Bonded states
41
A path
42
A path
P(seq) P(1 Begin) ? P(C40 1) ? ...
43
A path
P(seq) P(1 Begin) ? P(C40 1) ? ...
? P(2 1) ? P(C43 2) ? ..
44
A path
P(seq) P(1 Begin) ? P(C40 1) ? ...
? P(2 1) ? P(C43 2) ? ..
? P(4 2) ? P(C68 4) ? ..
45
A path
Begin
1
2
3
4
P(seq) P(1 Begin) ? P(C40 1) ? ...
? P(2 1) ? P(C43 2) ? ..
? P(4 2) ? P(C68 4) ? .. ?
P(End 4)
End
46
4 possible paths
47
(No Transcript)
48
Prediction for Triparedoxin
49
Prediction for Triparedoxin
50
Prediction for Triparedoxin
51
Performance
Neural Network
Hybrid system
B cysteine bonding state, Fcysteine free state.
WD whole database (969 proteins, 4136
cysteines) RD Reduced database, in which the
chains containing only one cysteine are removed
(782 proteins, 3949 cysteines).

Martelli PL, Fariselli P, Malaguti L, Casadio R.
-Prediction of the disulfide bonding state of
cysteines in proteins with hidden neural
networks- Protein Eng. 15951-953 (2002)

52
Problem no 2
When the bonding state of cysteines is known can
we predict the connectivity pattern of disulfide
bonds?
Prediction of the connectivity of disulfide bonds
in proteins
53
Prediction of disulfide connectivity in proteins
Bovine trypsin Inhibitor 6PTI
54
Prediction of disulfide connectivity in proteins
as a problem of maximum-weight perfect matching
Representation
Protein sequence
The undirected weighted graph with V2B vertices
(no of cysteines) and E2B(2B-1)/2 undirected
edges (strength of the interaction W)
55
From the Graph Theory

It is not necessary to compute all the possible
connectivity patterns (? (i ? B) (2i-1))

Given a complete graph G(2B,E)
the matching with the maximum weight can be
computed in a O((B)3) time
with the Edmonds-Gabows algorithm

Gabow, H.N. (1975). Technical
Report,CU-CS-075-75, Dept. of Comp. Sci. Colorado
University
56
How to assign the costs (W) of the edges in the
graph
57
Assumption for each cysteine all its sequence
nearest neighbours make contacts
58
Frequency distribution of disulfide bonds with
respect to sequence separation (726 proteins)
59
Neural Networks for predicting the edge values
60
Accuracy (Qp) of EG vs NN
61

The state of art
Prediction of bonding states is quite
satisfactory
Prediction of connectivity needs to be improved

62
Prediction of Foldons
Piero Fariselli
63
The Folding Problem as a Mapping Problem
Covalent structure TTCCPSIVARSNFNVCRLPGTPEAIC
ATYTGCIIIPGATCPGDYAN
64

We can collect from the PDB data base some 1500
chains of known structures from which to derive
non redundant information relating sequence to
secondary structure
structural and functional motifs
3D structure

Evolutionary information
Multiple Sequence Alignment (MSA) of similar
sequences
Sequence profile for each position a 20-valued
vector contains the aminoacidic composition of
the aligned sequences.

1 Y K D Y H S - D K K K G
E L - - 2 Y R D Y Q T - D
Q K K G D L - - 3 Y R D
Y Q S - D H K K G E L - -
4 Y R D Y V S - D H K K
G E L - - 5 Y R D Y Q F -
D Q K K G S L - - 6 Y K D
Y N T - H Q K K N E S -
- 7 Y R D Y Q T - D H K K
A D L - - 8 G Y G F G -
- L I K N T E T T K 9 T K
G Y G F G L I K N T E T
T K 10 T K G Y G F G L I
K N T E T T K A 0 0 0 0
0 0 0 0 0 0 0 10 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 D 0 0 70 0 0 0 0
60 0 0 0 0 20 0 0 0 E 0 0
0 0 0 0 0 0 0 0 0 0 70 0 0
0 F 0 0 0 10 0 33 0 0 0 0
0 0 0 0 0 0 G 10 0 30 0 30 0
100 0 0 0 0 50 0 0 0 0 H 0
0 0 0 10 0 0 10 30 0 0 0 0 0
0 0 K 0 40 0 0 0 0 0 0 10
100 70 0 0 0 0 100 I 0 0 0 0 0
0 0 0 30 0 0 0 0 0 0 0 L
0 0 0 0 0 0 0 30 0 0 0 0 0
0 0 0 M 0 0 0 0 0 0 0 0
0 0 0 0 0 60 0 0 N 0 0 0 0
10 0 0 0 0 0 30 10 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 Q 0 0 0 0 40 0 0
0 30 0 0 0 0 0 0 0 R 0 50 0
0 0 0 0 0 0 0 0 0 0 0 0
0 S 0 0 0 0 0 33 0 0 0 0 0
0 10 10 0 0 T 20 0 0 0 0 33
0 0 0 0 0 30 0 30 100 0 V 0 0
0 0 10 0 0 0 0 0 0 0 0 0
0 0 W 0 10 0 0 0 0 0 0 0 0
0 0 0 0 0 0 Y 70 0 0 90 0
0 0 0 0 0 0 0 0 0 0 0
MSA
sequence position
Sequence profile
66
Prediction of Initiation Sites of Protein Folding
The Folding Process
67
Frustration in proteins

The simultaneous minimisation of all the
interaction energies is impossible

68
The network architecture
..ALS.......QGFLLIARQPPFTYFTV......HW..
69
The prediction efficiency of the network
Q2 0.85 Q(H) 0.67 Q(nonH)
0.93 SovHpred 0.85 C 0.63 Pc(H)
0.80 Pc(nonH) 0.86 SovHobs 0.76
70
Theoretical background
The conformation of residue R depends both on
local (window W) and non local (context C)
interactions.
The convergence theorem ensures that Oi
Probability ( StructureR i W )
If , for any i, Oi 1 , then the structure of
residue R depends mainly on W and only slightly
on C
71

Anfinsens hypothesis

Averaging over all the contexts (performed by
NN)

When the pattern is self-stabilising (W
dependent)

Then the Anfinsens hypothesis can be cast in a
local form

72
Relationship between the reliability index and
the Shannon entropy
73
INPUT
MAS..... QLMLKDFLNRTPL.........GHI
.........
..........
Oa
O non-a
S Si Oi log Oi
_
74
Protein segments correctly predicted in
a-helical structure
Entropy Shannon-entropy in (ln 2)/10 units ( S
-S i o i ln ( o i ) ) NC Number of protein
segments correctly predicted in a-helix NT
Total number of protein segments predicted in
a-helix
75
Profile of the smoothed entropy (S5) for the
hen egg lysozyme (132L)
S5
Protein chain
76
Hen egg lysozyme (132L)
N-terminus
C-terminus
77
Frequency distribution of predicted helical
segments as a function of their entropy
value
Threshold value
78
An example of the data base of minimally
frustrated protein fragments
http//www.biocomp.unibo.it/DB/
79
Training set from PDB
Data base of minimally frustrated a-helical
segments
80
Comparison of minimally frustrated segments
with putative folding initiation sites
experimentally determined
Not yet experimentally detected
81
Comparison of minimally frustrated segments with
peptides extracted from proteins
Muñoz and Serrano, 1994.
82
Minimally frustrated a-helical segments are
useful for determining

Folding initiation sites

a-helix stability

de-novo design of a-helices

83
Structure prediction of membrane proteins
84
(No Transcript)
85
Outer Membrane proteins (all b-Transmembrane
proteins)
Inner Membrane proteins (all a-Transmembrane
proteins)
86
Outer Membrane
Inner Membrane
?-barrel
?-helices
Bilayer
Bacteriorhodopsin (Halobacterium salinarum)
Porin (Rhodobacter capsulatus)
87
Predictors of the Topology of Membrane Proteins
88
Prediction of transmembrane segments
89
Neural Network for the prediction of TMS in
?-barrel membrane proteins. (Jacoboni et al.,
2001)
TM
nonTM
2 output neurons
5 hidden neurons
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 70 0 0 0 0 60 0 0
0 0 0 0 0 0 0 0 0 0
0 10 0 33 0 0 0 10 0 30
0 30 0 100 0 0 0 0 0 0 10
0 0 10 30 0 40 0 0 0 0
0 0 10 0 0 0 0 0 0 0 0
30 0 0 0 0 0 0 0 30 0
0 0 0 0 0 0 0 0 0 0 0
0 0 10 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
40 0 0 0 30 0 50 0 0 0
0 0 0 0 0 0 0 0 0 33 0
0 0 20 0 0 0 0 33 0 0
0 0 0 0 0 10 0 0 0 0 0
10 0 0 0 0 0 0 0 70 0
0 90 0 0 0 0 0
Window 9 residues
90
A generic model for membrane proteins (TMHMM)
91
Sequence-profile-based HMM
0 85 0 0 5 0 0 0 0 2 0 8 0 0 0 0
0 0 0 0
0 0 0 0 4 0 13 0 4 0 5 0 6 0 0 23
0 1 44 0
0 0 22 0 23 0 0 5 0 23 0 3 0 11 0 0
2 0 11 0
0 34 0 0 0 24 0 0 0 0 0 2 0 22 0 18
0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 0 92 0 0 0
0 0 0 0
90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 77 0 23
3 0 2 7 4 0 8 6 1 3 6 5 5 12 5
6 17 2 2 6
For proteins A20
Constraints
0 ? st (n) ? S ? t,n S100
Martelli et al., Bioinformatics 18, S46-53, 2002
92

The new algorithms make possible
to feed HMMs with sequence profiles
to eventually couple NNs and HMMs (Hidden Neural
Networks)
Advantages
Higher performance than standard HMMs
Increased discrimination capability of a given
class

Martelli et al., Bioinformatics, 2002 Martelli et
al., Protein Eng. 2002,
93
Prediction of the Topology of a-Transmembrane
Proteins
The prediction accuracy of topography is 92
The prediction accuracy of topology is 81
94
Prediction of the Topology of b-Transmembrane
Proteins
The prediction accuracy of topography is 73
The prediction accuracy of topology is 73
LPS (Out)

Periplasmic (In)

Topology
position of N and C termini with respect to the
bilayer
95
The discriminative capability of the HMM model
I(s M) -1/L log P(s M)
96
An application modeling the 3D structure of
eukaryotic ? barrel proteins
97
3D structure prediction of proteins
New folds
Existing folds
Membrane proteins
Building by homology
Ab initio prediction
Threading/ fold recognition
0 10 20 30 40 50 60 70 80
90 100
Homology ()
98
Structural alignment of VDAC with the template
99
A low resolution 3D Model of VDAC the sequence
from Neurospora crassa)
Casa
100
A low resolution 3D model of VDAC location of
mutated residues
Casadio et al., FEBS Lett 5201-7 (2002)
101
Predictors of membrane protein structures can be
used to filter genomes and find new membrane
proteins without sequence homologoues
102
FISHING NEW OUTER MEMBRANE PROTEINS IN
GRAM-NEGATIVE BACTERIA
103
Proteins have intrinsic signals that govern their
transport and localization in the cell a
secretion hydrophic marker (or signal peptide)
Signal peptides in protein sequences
MRAKLLGIVLTTPIAISSFASTETLSFTPDNINADISLGTLSGKTKERVY
LAEEGGRKVSQLDWK FNNAAIIKGAINWDLMPQISIGAAGWTTLGSRGG
NMVDQDWMDSSNPGTWTDESRHPDTQL NYANEFDLNIKGWLLNEPNYRL
GLMAGYQESRYSFTARGGSYIYSSEEGFRDDIGSFPNGER AIGYKQRFK
MPYIGLTGSYRYEDFELGGTFKYSGWVESSDNDEHYDPGKRITYRSKVKD
QNY YSVAVNAGYYVTPNAKVYVEGAWNRVTNKKGNTSLYDHNNNTSDYS
KNGAGIENYNFITTAG LKYTF
Sequences of outer membrane proteins have signal
peptides the secretion marker is also a marker
of outer membrane proteins
104
Signal Peptide prediction
Signal Pepetide Mature protein
Cleavage site
105
2 Neural Networs
CleavageNet
SignalNet
Predicts if a given residue position belongs to
the Signal Pepetide
Predicts if a given residue position is the
cleavage site
106
SignalNet Accuracy
107
CleavageNet Accuracy
108
(No Transcript)
109
Performance of SignalNN on 2160 annotated proteins
Q2 96 Qsignal 82
Qnon-signal 97 Psignal 78
Pnon-signal 98
110
Predictors of Membrane Topography Rate of false
positives
The predictors are tested on on 809 globular
protein with sequence identity ? 25
0.5 have at least 1 a-TM helix predicted
5.6 have at least 2 b-TM strand
predicted
111
PROTEOME
HUNTER
Signal peptide
Yes
No
All-a TM
All-a TM
No
All-b TM
112
Predicting globular, inner and outer membrane
proteins in genomes of Gram-negative bacteria
with Hunter
the number of new proteins predicted in the
class with Hunter, out of the non-annotated
region
113
(No Transcript)

Write a Comment

User Comments (0)