Title: Gene Finding
1Genetic Algorithms and Protein Folding
Based on lecture by Dr. Steffen
Schulze-Kremer http//www.techfak.uni-bielefeld.de
/bcd/Curric/ProtEn/proten.html
2Genetic Algorithm is a heuristic method that
operates on pieces of information like nature
does on genes in the course of evolution.
- Individuals are represented by a linear string of
letters of an alphabet (in nature nucleotides, in
genetic algorithms bits) - Individuals are allowed to mutate, crossover and
reproduce. - Fitness function evaluates individuals.
- Depending on the generation replacement mode a
subset of parents and offspring enters the next
reproduction cycle. - After a number of iterations the population
consists of individuals that are well adapted in
terms of the fitness function. - It cannot be proven that the individuals of a
final generation contain an optimal solution for
the objective encoded in the fitness function.
3- Initialise a population of individuals.
- This can be done either randomly or with domain
specific background knowledge to start the search
with promising seed individuals. (Where available
the latter is always recommended. ) - Individuals are represented as a string of bits.
- A fitness function must be defined that takes as
input an individual and returns a number (or a
vector) that can be used as a measure for the
quality (fitness) of that individual. - The application should be formulated in a way
that the desired solution to the problem
coincides with the most successful individual
according to the fitness function.
4II. Evaluate all individuals of the initial
population. III. Generate new individuals. The
reproduction probability for an individual is
proportional to its relative fitness within the
current generation.
5Crossover
two point crossover
0101001111000011010101011110111 10101011010111001
01110001010101
uniform crossover
0101001111000011010101011110111 10101011010111001
01110001010101
6Genetic Operators Mutation. Substitute one or
more bits of an individual randomly by a new
value (0 or 1). Variation. Change the bits in
a way that the number encoded by them is
slightly incremented or decremented.
Crossover. Exchange parts (single bits or
strings of bits) of one individual with the
corresponding parts of another individual.
Originally, only one-point crossover was
performed but theoretically one can process up
to L - 1 different crossover sites (with L as the
length of the individual).
7 IV. Select individuals for the new parent
generation. Schemes 1) Complete offspring
is selected while all parents are discarded
(original genetic algorithm). This is motivated
by the biological model and is called total
generation replacement. 2) The n best
individuals (from old and new generation)
This method is called elitist generation
replacement. V. Go back to step 2 until either
a desired fitness value was reached or until
a predefined number of iterations was performed
8Init the first generation
9Representation Formalism
- hybrid approach - genetic algorithm is configured
to operate on numbers, not bit strings as in the
original genetic algorithm. - Disadvantages
- the mathematical foundation of genetic algorithms
holds only for binary representations, although
some of the mathematical properties are also
valid for a floating point representation. - Binary representations run faster in many
applications. - An additional encoding/decoding process may be
required to map numbers onto bit strings.
10Protein Structure Prediction
Individuals - Protein Conformations Fitness
Function Force Field
11Representation
Cartesian 3D coordinates is not a good choice
12- The frequency of each torsion angle in intervals
of 10 was determined and the ten most frequently
occurring intervals are made available for
substitution of individual torsion angles by the
MUTATE operator. - At the beginning of the run, individuals were
initialized with either a completely extended
conformation where all torsion angles are 180 or
by a random selection from the ten most
frequently occurring intervals of each torsion
angle. - For the w torsion angle the constant value of
180 was used because of the rigidity of the
peptide bond between the atoms Ci and Ni1.
13Search Space
- Generally molecules with n atoms have 3n - 6
degrees of freedom -gt - 100 residues approximately 20 atoms per residue
5994 degrees of freedom - Systems of equations with this number of
variables are analytically intractable today. - Discrete approximation
- (5 torsion angles per residue
- 5 likely values per torsion angle) 25100
14Fitness Function - Potential Energy
Charmm energy func
.
bond length potential (set to
const) bond angle potential (set to
const) torsion angle potential improper torsion
angle potential (set to const) van der Waals
pair interactions electrostatic
potential hydrogen bonds (set to
const) interaction with the solvent (set to
const -gt in vacum)
Simplified to
(since there are no interactions with the
solvent, there is not enough force to drive the
protein to a compact folded state)
15Simplified Energy Function
Empirical relation between the number of residues
and the diameter
.
pseudo entropic term
16First Testprotein Crambin, 46 a.a.
Table 3. Steric Energies in the Last Generation
Table 2. R.m.s. Deviations to Native Crambin
The genetic algorithm favoured individuals with
lowest total energy which in this case was most
easily achieved by optimising electrostatic
contributions.
Simple summation of different components has the
disadvantage that components with larger numbers
would dominate the fitness function whether or
not they are important or of any significance at
all for a particular conformation.
In other words -gt bad fitness function
17Improvements
- Instead of using separate phi psi value
distributions, apply phi-psi (2D) clustering
procedure. - Use secondary structure prediction algorithm (70
accuracy). - Specialised Genetic Operators
- LOCAL TWIST (local conformation changes by
performing the ring closure algorithm for
polymers)
The LOCAL TWIST operator led to significant
improvements in prediction accuracy and also to a
substantial decrease in overall computation time.
18Improvements(2)Fitness Function -gt vector
r.m.s. only for verification
19Vector Fitness Function
- Candidate selection for the next generation
- If there is an individual that has better (i.e.
lower) values in each fitness component, then we
take it. Continue until no unambiguously better
individuals are found. - Then remove the worst individuals, i.e. those
with higher values in each fitness component than
any other individual. - The remaining set of individuals is heuristically
reduced until the exact number of individuals for
the next generation is reached. This is done by
iteratively removing an individual with the worst
fitness value in a randomly selected fitness
component.
20Capability of Genetic Algorithm in General?
Tests on other proteins (Local Twist and rms
fitness) gave also close to native conformations
(less than 3.0 A)
Conclusion applying an appropriate fitness
function genetic algorithm achieves the desired
results.
21Test case Crambin 46 a.a.
polar, , , , hydro, Crippen and
solvent
I. Fitness vector
, hydro, Crippen, solvent decreased with
rms
polar, , mislead the algorithm to
non-native conformation
-gt Rms 6.27
II. Fitness vector
Crippen, clash, hydro and scatter
constraints on the secondary structures
-gt Rms 4.36
trypsin inhibitor -gt 6.65
22Conclusions
- Genetic algorithms proved to be an efficient
search tool for 3-D representations of proteins.
For a 3-D protein model with a simple, additive
force field as fitness function and using a
rather small population the genetic algorithm
produced several individuals (i.e. protein
conformations) of dissimilar topology but each
with highly optimized fitness values. - Given an appropriate fitness function the genetic
algorithm application described here finds the
desired solution within only small deviations. - The major problem lies in the fitness function.
If there were one or a set of indicators that
return 1for the object is native protein
conformation and 0 for the object is not a native
protein conformation one could expect the genetic
algorithm approach to deliver reasonably accurate
ab initio predictions. However, neither
mathematical models, empirical, semi-empirical or
statistical force fields are yet accurate enough
to reliably discriminate native from non-native
conformations without additional constraints.
Thus, the genetic algorithm produces
(sub-)optimal conformations in a different sense
than that of nativeness.
Notice the same problem (fitness-scoring
function) exists in the Protein Docking problem.
The correct transformation (within 3-5A) is found
in realistic time (almost in all cases). However,
to assign a high score to the native complex is a
problematic task. We dont know yet a proper
scoring function.
23Side Chain Placement
rms 1.86