Gene Finding - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Gene Finding

Description:

Fitness function ... A fitness function must be defined that takes as input an ... applying an appropriate fitness function genetic algorithm ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 24
Provided by: maxs9
Category:
Tags: finding | gene

less

Transcript and Presenter's Notes

Title: Gene Finding


1
Genetic Algorithms and Protein Folding
Based on lecture by Dr. Steffen
Schulze-Kremer http//www.techfak.uni-bielefeld.de
/bcd/Curric/ProtEn/proten.html
2
Genetic Algorithm is a heuristic method that
operates on pieces of information like nature
does on genes in the course of evolution.
  • Individuals are represented by a linear string of
    letters of an alphabet (in nature nucleotides, in
    genetic algorithms bits)
  • Individuals are allowed to mutate, crossover and
    reproduce.
  • Fitness function evaluates individuals.
  • Depending on the generation replacement mode a
    subset of parents and offspring enters the next
    reproduction cycle.
  • After a number of iterations the population
    consists of individuals that are well adapted in
    terms of the fitness function.
  • It cannot be proven that the individuals of a
    final generation contain an optimal solution for
    the objective encoded in the fitness function.

3
  • Initialise a population of individuals.
  • This can be done either randomly or with domain
    specific background knowledge to start the search
    with promising seed individuals. (Where available
    the latter is always recommended. )
  • Individuals are represented as a string of bits.
  • A fitness function must be defined that takes as
    input an individual and returns a number (or a
    vector) that can be used as a measure for the
    quality (fitness) of that individual.
  • The application should be formulated in a way
    that the desired solution to the problem
    coincides with the most successful individual
    according to the fitness function.

4
II. Evaluate all individuals of the initial
population. III. Generate new individuals. The
reproduction probability for an individual is
proportional to its relative fitness within the
current generation.
5
Crossover
two point crossover
0101001111000011010101011110111 10101011010111001
01110001010101
uniform crossover
0101001111000011010101011110111 10101011010111001
01110001010101
6
Genetic Operators Mutation. Substitute one or
more bits of an individual randomly by a new
value (0 or 1). Variation. Change the bits in
a way that the number encoded by them is
slightly incremented or decremented.
Crossover. Exchange parts (single bits or
strings of bits) of one individual with the
corresponding parts of another individual.
Originally, only one-point crossover was
performed but theoretically one can process up
to L - 1 different crossover sites (with L as the
length of the individual).                      
                                                  
                                                  
                                                  
                     
7
IV. Select individuals for the new parent
generation. Schemes 1) Complete offspring
is selected while all parents are discarded
(original genetic algorithm). This is motivated
by the biological model and is called total
generation replacement. 2) The n best
individuals (from old and new generation)
This method is called elitist generation
replacement. V. Go back to step 2 until either
a desired fitness value was reached or until
a predefined number of iterations was performed
8
Init the first generation
9
Representation Formalism
  • hybrid approach - genetic algorithm is configured
    to operate on numbers, not bit strings as in the
    original genetic algorithm.
  • Disadvantages
  • the mathematical foundation of genetic algorithms
    holds only for binary representations, although
    some of the mathematical properties are also
    valid for a floating point representation.
  • Binary representations run faster in many
    applications.
  • An additional encoding/decoding process may be
    required to map numbers onto bit strings.

10
Protein Structure Prediction
Individuals - Protein Conformations Fitness
Function Force Field
11
Representation
Cartesian 3D coordinates is not a good choice
12
  • The frequency of each torsion angle in intervals
    of 10 was determined and the ten most frequently
    occurring intervals are made available for
    substitution of individual torsion angles by the
    MUTATE operator.
  • At the beginning of the run, individuals were
    initialized with either a completely extended
    conformation where all torsion angles are 180 or
    by a random selection from the ten most
    frequently occurring intervals of each torsion
    angle.
  • For the w torsion angle the constant value of
    180 was used because of the rigidity of the
    peptide bond between the atoms Ci and Ni1.

13
Search Space
  • Generally molecules with n atoms have 3n - 6
    degrees of freedom -gt
  • 100 residues approximately 20 atoms per residue
    5994 degrees of freedom
  • Systems of equations with this number of
    variables are analytically intractable today.
  • Discrete approximation
  • (5 torsion angles per residue
  • 5 likely values per torsion angle) 25100

14
Fitness Function - Potential Energy
Charmm energy func
                         
                       .
bond length potential (set to
const) bond angle potential (set to
const) torsion angle potential improper torsion
angle potential (set to const) van der Waals
pair interactions electrostatic
potential hydrogen bonds (set to
const) interaction with the solvent (set to
const -gt in vacum)
Simplified to
               
(since there are no interactions with the
solvent, there is not enough force to drive the
protein to a compact folded state)
15
Simplified Energy Function
Empirical relation between the number of residues
and the diameter
                .
pseudo entropic term
16
First Testprotein Crambin, 46 a.a.
Table 3. Steric Energies in the Last Generation
                                                  
                                                 
Table 2. R.m.s. Deviations to Native Crambin
                                                  
                                
The genetic algorithm favoured individuals with
lowest total energy which in this case was most
easily achieved by optimising electrostatic
contributions.
Simple summation of different components has the
disadvantage that components with larger numbers
would dominate the fitness function whether or
not they are important or of any significance at
all for a particular conformation.
In other words -gt bad fitness function
17
Improvements
  • Instead of using separate phi psi value
    distributions, apply phi-psi (2D) clustering
    procedure.
  • Use secondary structure prediction algorithm (70
    accuracy).
  • Specialised Genetic Operators
  • LOCAL TWIST (local conformation changes by
    performing the ring closure algorithm for
    polymers)

The LOCAL TWIST operator led to significant
improvements in prediction accuracy and also to a
substantial decrease in overall computation time.
18
Improvements(2)Fitness Function -gt vector
r.m.s. only for verification
19
Vector Fitness Function
  • Candidate selection for the next generation
  • If there is an individual that has better (i.e.
    lower) values in each fitness component, then we
    take it. Continue until no unambiguously better
    individuals are found.
  • Then remove the worst individuals, i.e. those
    with higher values in each fitness component than
    any other individual.
  • The remaining set of individuals is heuristically
    reduced until the exact number of individuals for
    the next generation is reached. This is done by
    iteratively removing an individual with the worst
    fitness value in a randomly selected fitness
    component.

20
Capability of Genetic Algorithm in General?
Tests on other proteins (Local Twist and rms
fitness) gave also close to native conformations
(less than 3.0 A)
Conclusion applying an appropriate fitness
function genetic algorithm achieves the desired
results.
21
Test case Crambin 46 a.a.
polar,     ,     ,    , hydro, Crippen and
solvent
I. Fitness vector
    , hydro, Crippen, solvent decreased with
rms
polar,     ,     mislead the algorithm to
non-native conformation
-gt Rms 6.27
II. Fitness vector
Crippen, clash, hydro and scatter
constraints on the secondary structures
-gt Rms 4.36
trypsin inhibitor -gt 6.65
22
Conclusions
  • Genetic algorithms proved to be an efficient
    search tool for 3-D representations of proteins.
    For a 3-D protein model with a simple, additive
    force field as fitness function and using a
    rather small population the genetic algorithm
    produced several individuals (i.e. protein
    conformations) of dissimilar topology but each
    with highly optimized fitness values.
  • Given an appropriate fitness function the genetic
    algorithm application described here finds the
    desired solution within only small deviations.
  • The major problem lies in the fitness function.
    If there were one or a set of indicators that
    return 1for the object is native protein
    conformation and 0 for the object is not a native
    protein conformation one could expect the genetic
    algorithm approach to deliver reasonably accurate
    ab initio predictions. However, neither
    mathematical models, empirical, semi-empirical or
    statistical force fields are yet accurate enough
    to reliably discriminate native from non-native
    conformations without additional constraints.
    Thus, the genetic algorithm produces
    (sub-)optimal conformations in a different sense
    than that of nativeness.

Notice the same problem (fitness-scoring
function) exists in the Protein Docking problem.
The correct transformation (within 3-5A) is found
in realistic time (almost in all cases). However,
to assign a high score to the native complex is a
problematic task. We dont know yet a proper
scoring function.
23
Side Chain Placement
rms 1.86
Write a Comment
User Comments (0)
About PowerShow.com