Title: Genetic Threading
1Genetic Threading
- By J.Yadgari and A.Amir
- Published special issue on Bioinformatics in
Journal of Constraints, June 2001
Alexandre Tchourbanov University of Nebraska at
Lincoln CSCE 421-821 December 4, 2001
2Structure of the presentation
- Introduction to protein native structure
- Methods of finding a native structure
- Physical
- Computational
- Common methods and principles
- Protein threading method
- Protein threading using genetic approach
3Problem of protein structure prediction
- Proteins are key molecules in all life processes
- The function of a protein directly related to its
three dimensional structure - Knowing and understanding the structure of
proteins will have a tremendous impact on
understanding of biological processes, medical
discoveries, and biotechnological inventions
4Problem of protein structure preduction
- Given a sequence of amino acids, predict the
unique 3D folding of molecule minimizing its free
energy
1
2
3
Lys
Computational Methods of prediction
Practical use of the 3D structural knowledge
Gly
Leu
Physical methods of prediction
Primary structure
5Protein structure
- A protein is built up from a chain of amino acids
linked by peptide bonds - There are 20 amino acids that can be divided into
several classes based on size and other chemical
and physical properties - Depending on type of a residue, protein could be
either hydrophilic (water loving) or hydrophobic
(water hating)
6General structure of an amino acid
- Each amino acid consists of
- Common main chain part, containing the heavy
atoms N, C, O, C? forming amide plane - Chain residue of size 0 10 additional atoms
7Peptide bond
- Peptide bond connects carboxyl group of the first
amino acid with amino group of the second acid - Peptide bonds are planar and rigid
8Sequence of amino acids
- Sequence of amino acids, connected by peptide
bonds, form protein - There is no flexibility for rotation around
peptide bond - There is more flexibility for protein to rotate
around N-C?-bond (called the ?-angle) and around
C-C?-bond (?-angle) - These angles are restricted to small regions in
natural proteins
9Part of Protein (PheAspAla)
10Protein folding
- Using the freedom of rotations, the protein can
fold into a specific and unique three dimensional
structure (called conformation), forming a native
structure
11Physical methods of determiningprotein native
structure
X-ray crystallography
Physical methods
NMR (Nuclear Magnetic Resonance)
- X-ray crystallography requires significant
amounts of purified protein molecules (1014) to
grow a crystal and protein needs to crystallize - NMR method applicable to proteins of small and
average size, which do not crystallize - Both methods are expensive and give coherent
results on the same protein, proving to be
correct - Structure of many important proteins is still
unknown
12Protein structure in X-ray crystallography
- X-ray diffraction pattern is recorded and
processed using FFT to form electron density map - Regions of map with the highest electron density
reveal the location of atomic nuclei
13Family of structures in NMR method
- Absorption of radio frequency energy is recorded
as a 2D spectrum - Possible 3D structures are constructed by
computer according to NMR signal
14Computational methods to find a protein structure
- The unique 3D arrangement of protein corresponds
to lowest free energy conformation - Most computational approaches for solving the
protein folding problem look for the lowest free
energy conformation - Two principal methods are currently in use for
computing the lowest energy conformation - Molecular dynamics
- Monte Carlo
15Molecular dynamics
- Forces acting on each atom at a particular state
of the system are calculated using an empirical
force field - Atoms allowed to move with accelerations
resulting from forces, changing conformation - Once atom moved significantly, acting forces are
recalculated (every 10-15 sec) - Even super computers can simulate only 10-9 sec
of folding time, which is insufficient
16Monte Carlo method
- Used with simplified model of protein (does not
consider structure of every amino acid) - Procedure makes random move from current
conformation and evaluates resulting energy
changes - If new conformation is better, it replaces old
one with newly generated, and process repeats - Method is not powerful enough to find an optimal
conformation even for simple cases
17Protein threading
- Many proteins in nature are homologous, having
different primary structure, but forming the same
conformation to carry out the same functionality
in a living matter and having the same
evolutionary origin - Most protein share the secondary structure
motifs - Helices
- Extended strands forming sheets
- Specific turns
- Random coils
18Protein threading
- Threading means mapping a given sequence to a
given structure - To assign a structure to a sequence one would
then need to thread the sequence through all
known conformations, evaluating compatibility,
and assign the most compatible structure to the
sequence - Upon discovery of completely different structure
from any known, enter it into database of
structures
19Protein threading
- Structure is presented by the black trace
- Sequence (at the top) is threaded through the
structure, encoding an alignment (at the bottom) - Zero means structure deletion, values greater
that one mean sequence deletion, while one is a
fit
20Protein threading
- The size of the search space to thread sequence
of length k into structure of size n could be
found as a selection with repetition - Search space is huge and problem appears to be
NP-complete Unger,R., Moult,J. (1993)
21Protein threading
- In order to reduce complexity of search task, (m
1) core and m non-core regions are introduced - Usually ?-helices and ?-sheets are core regions,
connected by loops - Total number of amino acids in core regions is c
22Protein threading
- Although suffering from some inherent limitations
(such as prediction of the right structure with
completely wrong threading), method became a
significant tool in protein structure prediction - Any threading procedure must contain two major
components - An alignment algorithm to position a sequence on
a structure - Score function to evaluate the energy of the
sequence in given conformation
23Protein threading possible implementations
- Protein threading could be implemented using
- Enumeration for small problems,
- Dynamic programming to find core regions to
freeze, - Monte Carlo variants with Gibbs sampling
- Branch and bound search
- Genetic programming with constraints seems to be
a decent alternative in comparison with other
methods
24Protein threading using genetic programming
- Genetic Algorithms are parallel computational
tools that are based on the principle of
diversity and selection - Solutions are represented as strings, for example
11111100111311 - Sum of all terms in the string needs to be equal
to the number of amino acids in the sequence, as
well as length of the string equal to the length
of the structure
25Protein threading using genetic programming
- These strings are maintained as a population that
undergoes evolutionary process via generic
operators such as - Replication (copying of the string to the next
generation) - Mutation (changing bits in the string)
- Crossover (concatenating a prefix of one string
with suffix of another) - Energy function is a good candidate to evaluate
fit of an offspring
26Energy function
- Energy functions are subject to minimizations
- Energy functions are calculated by extracting
from the structural database frequencies of
interactions between pairs of residues as a
function of amino acids types and distance - Tendency of certain hydrophilic residues to be on
the surface can be approximated by energy term
related to the position
27Implementing mutation
- An example of mutation could be transformation of
1111100111311 into 11111100211211, which is also
a valid encoding - We need to have validity check every time we do
mutation and compensate for problems - Reverting of substrings is especially interesting
mutation, since it does not violate a valid
structure of the solution
28Implementing crossovers
Parent 1
1
1
2
0
1
1
2
0
1
1
1
1
1
1
Parent 2
1
1
1
1
1
1
0
0
1
1
1
3
1
1
Offsprings
1
1
2
0
1
1
2
0
1
1
1
3
1
1
1
1
1
1
1
1
0
0
1
1
1
1
1
1
29Following issues were addressed
- The linear trade-off between population size and
the number of generations - Optimal level of mutation rate
- Locality of mutation operator
- Locality of the crossover operator
- Regular mutations versus reverse mutations
- Magnitude of the mutation operation
- Quality control of the crossover operation
30Results
- For authors examples, the optimal performance is
achieved with population size of 300 solutions
and duration of 1000 generations - The optimal rate of mutations is 0.25 to 0.3 of
the populations
31The minimal energy of threading runs
32The average energy of the population during
threading
33Structural comparisons
Difference between sequence deletions
and structure deletions plots
Structural alignment
Most similar threading alignment
Least similar threading alignment
34Maximal mutation magnitude
Average score of 5 runs after 600 generations
Average score of 5 runs after 2000 generations
35Summary
- The running time of a GA depends linearly on the
number of solutions in the population (i.e.
population size) and also depends linearly on the
number of generations the process is repeated - Genetic algorithms method is a feasible and
efficient approach to threading - It is especially encouraging that the threading
alignments are quite similar, quantitatively, to
the structural alignments
36Summary
- Changing the locality of the mutation and
crossover operation does not show a consistent
change in the performance of the algorithm - Mutations of high magnitude are
counterproductive, probably because changes
between the template and the assigned structure
do not tend to concentrate in single position - Using crossover under strict quality control was
shown not to be effective, since genetic
mechanism has quality control itself
37Summary
- The success of the reverse mutation is quite
surprising and should be further explored
38Future work
- Threading algorithms should be tested on their
ability to assign a conformation for new and
unknown sequence - Authors plan to implement the genetic algorithm
in a complete threading package, with all the
necessary components and to test it in a
realistic prediction setup.