Title: Protein structure prediction
1Protein structure prediction
Alexander Churbanov University of Nebraska at
Omaha CSCI 8980 February 14, 2002
2Structure of the presentation
- Introduction
- Protein native structure
- Computational methods of finding a native
structure - Common methods and principles
- Specific methods
- Homology finding
- Threading
- Modeling on lattice
3Introduction
- In Greek mythology, Sisyphus is condemned to an
eternity of hard labor his labor is a
frustrating and fruitless, for just as he is
about to achieve his goal, his work is undone and
he must start again from the beginning - Those who work in protein structure prediction
seem to share the same fate
4Problem of protein structure prediction
- Proteins are key molecules in all life processes
- The function of a protein directly related to its
three dimensional structure - Knowing and understanding the structure of
proteins will have a tremendous impact on
understanding of biological processes, medical
discoveries, and biotechnological inventions
5Problem of protein structure prediction
- For over 30 years, there has been an ardent
search for methods to the predict
three-dimensional (3D) structure from the
sequence - Many methods were found which looked initially
very promising - but always the hope has been
dashed
6Problem of protein structure preduction
- Given a sequence of amino acids, predict the
unique 3D folding of molecule minimizing its free
energy
1
2
3
Lys
Computational Methods of prediction
Practical use of the 3D structural knowledge
Gly
Leu
Physical methods of prediction
Primary structure
7General structure of an amino acid
- Each amino acid consists of
- Common main chain part, containing the heavy
atoms N, C, O, C? forming amide plane - Chain residue of size 0 10 additional atoms
8Peptide bond
- Peptide bond connects carboxyl group of the first
amino acid with amino group of the second acid - Peptide bonds are planar and rigid
9Sequence of amino acids
- Sequence of amino acids, connected by peptide
bonds, form protein - There is no flexibility for rotation around
peptide bond - There is more flexibility for protein to rotate
around N-C?-bond (called the ?-angle) and around
C-C?-bond (?-angle) - These angles are restricted to small regions in
natural proteins
10Part of Protein (PheAspAla)
11Protein folding
- Using the freedom of rotations, the protein can
fold into a specific and unique three dimensional
structure (called conformation), forming a native
structure
12Computational methods to find a protein structure
- The unique 3D arrangement of protein corresponds
to lowest free energy conformation - Most computational approaches for solving the
protein folding problem look for the lowest free
energy conformation - Two principal methods are currently in use for
computing the lowest energy conformation - Molecular dynamics
- Monte Carlo
13Molecular dynamics
- Forces acting on each atom at a particular state
of the system are calculated using an empirical
force field - Atoms allowed to move with accelerations
resulting from forces, changing conformation - Once atom moved significantly, acting forces are
recalculated (every 10-15 sec) - Even super computers can simulate only 10-9 sec
of folding time, which is insufficient
14Monte Carlo method
- Used with simplified model of protein (does not
consider structure of every amino acid) - Procedure makes random move from current
conformation and evaluates resulting energy
changes - If new conformation is better, it replaces old
one with newly generated, and process repeats - Method is not powerful enough to find an optimal
conformation even for simple cases
15Knowledge based structure prediction methods
- The most successful structure prediction tools
are knowledge-based, using a combination of
statistical theory and empirical rules - The most successful theoretical approach is
homology modelling
16Homology modeling
- Given a sequence of unknown fold (denote U), if U
has significant sequence similarity to a protein
of known structure (T) (i.e., if the pairwise
sequence identity is gt25), it is possible to
construct an approximate 3D model which has a
correct fold but inaccurate loop regions
17Homology modeling
- The basic assumption of homology modelling is
that U and the homologous template protein of
known structure (T) have nearly identical
backbone structure in the aligned regions - A new generation of alignment methods are based
on Hidden Markov Models and another on Genetic
algorithms
18Homology modeling
- For sequence identities down to about 30
sequence identity, U and T will still have the
same fold, but the number of loops inserted grows
and the divergence between U and T becomes
considerable - Modelling of loop regions is still a difficult
problem even the best methods only rarely
achieve atomic accuracy and are often completely
different to the correct structure
19Homology modeling
- A pessimistic view is that the accuracy of
resulting 3D predictions is typically at the
level of ribbon plots, i.e. the mutual
orientation of elements such as helices and
sheets can be identified - The optimistic version is that even down to
levels of 30 sequence identity homology
modelling occasionally yields correct predictions
at atomic resolution
20Three difficult problems of homology modeling
- Remote homology modelling (lt25) has three
obstacles to overcome - the remote homology between U and T has to be
detected - U and T have to be aligned correctly
- the homology modelling procedure has to be
tailored to the harder problem of extremely low
sequence identity
21Solution to the first problem
- In the early 1990s, there was a great deal of
optimism that the first obstacle, the detection
of similar folds, would be solved by threading
methods - The basic idea is to thread the sequence of U
into the backbone 3D structure of T, at each step
evaluating the 'fitness of sequence for
structure' using environment-based or
knowledge-based mean-force-potentials
22Protein threading
- Many proteins in nature are homologous
- They have different primary structure
- They form similar conformation to carry out the
same functionality in a living matter - There are groups of proteins having the same
evolutionary origin
23Protein threading
- Most protein share the secondary structure
motifs - Helices
- Extended strands forming sheets
- Specific turns
- Random coils
24Protein threading
- Threading means mapping a given sequence to a
given structure - To assign a structure to a sequence one would
then need to thread the sequence through all
known conformations, evaluating compatibility,
and assign the most compatible structure to the
sequence - Upon discovery of completely different structure
from any known, enter it into database of
structures
25Protein threading
- Structure is presented by the black trace
- Sequence (at the top) is threaded through the
structure, encoding an alignment (at the bottom) - Zero means structure deletion, values greater
that one mean sequence deletion, while one is a
fit
26Protein threading
- The size of the search space to thread sequence
of length k into structure of size n could be
found as a selection with repetition - Search space is huge and problem appears to be
NP-complete Unger,R., Moult,J. (1993)
27Protein threading
- In order to reduce complexity of search task, (m
1) core and m non-core regions are introduced - Usually ?-helices and ?-sheets are core regions,
connected by loops - Total number of amino acids in core regions is c
28Protein threading
- Although suffering from some inherent limitations
(such as prediction of the right structure with
completely wrong threading), method became a
significant tool in protein structure prediction - Any threading procedure must contain two major
components - An alignment algorithm to position a sequence on
a structure - Score function to evaluate the energy of the
sequence in given conformation
29Protein threading possible implementations
- Protein threading could be implemented using
- Enumeration for small problems,
- Dynamic programming to find core regions to
freeze, - Monte Carlo variants with Gibbs sampling
- Branch and bound search
- Genetic programming with constraints seems to be
a decent alternative in comparison with other
methods
30Protein structure prediction on lattice
- Another way to model protein folding in 3D space
is to assume certain simplifications - Modeling on Lattice is a way to fight the
complexity of the prediction problem - Though the problem solution on Lattice is still
NP-complete, we can expand size of the protein
modeled significantly
31Protein simplification for lattice model
- Monomers (or residues) are represented using a
unified size - Bond length is unified
- The positions of the monomers are restricted to
positions in a lattice - Simplified energy function
32HP - model
- 20 letter alphabet of amino acids is reduced to a
two letter alphabet, namely H and P - H represents non-polar or hydrophobic amino acid
- P represents polar or hydrophilic amino acid
33The energy function
- The energy function for HP-model is given by the
matrix
- Energy contribution of a contact between two
monomers is 1 if both are H-monomers, and 0
otherwise
34Contact energy
- Two monomers form a contact in some specific
conformation if they are not connected via a
bond, but occupy neighboring positions in the
conformation - A conformation with minimal energy is just a
conformation with the maximal number of contacts
between H-monomers
35Sample conformation
- A sample conformation for the sequence PHPPHHPH
in the two-dimentional lattice with energy 2 is
36Cubic lattice
37Native conformation
38Vertical and horizontal contribution to the
surface of a conformation in
Vertical contribution to the surface
Horizontal contribution to the surface
39Conclusions
- Native 3D structures of proteins are encoded by a
linear sequence of amino acid residues - To predict 3D structure from sequence is a task
challenging enough to have occupied a generation
of researchers - Have they finally succeeded in their goal? The
bad news is no, we still cannot predict
structure for any sequence - The good news are we have come closer, and
growing databases facilitate the task.