Title: Discrete and Genetic Algorithms in Bioinformatics
1Discrete and Genetic Algorithms in Bioinformatics
2Discrete Algorithms
- Discrete Math. lies in the foundation of modern
computer science - Most algorithms we have learned in computer
science are discrete - Discrete algorithms emphasize worst case
analysis - Many sequence manipulation algorithms in
bioinformatics are discrete
3Natural Problems (1)
- Natural problems Problems arisen from nature,
which are guaranteed to have feasible solutions
if data is collected accurately. - But because of noises in sampled data, such
solutions are hard to come by. - To tackle these problems one should focus on real
data rather than worst case analysis.
4Natural Problems (2)
- Techniques taking advantage of the natural
constraints of these problems do not necessarily
work for general data (especially the worst
case), but could perform very well for those
well-structured problems. - Examples
- many computational problems arisen from biology,
speech recognition, and image processing
5Constraints with Errors
- In ordinary constraint optimization problems, one
naturally assumes that the constraints are
correct. - What if these constraints are inconsistent?
- There is no feasible solution satisfying them
- What if every constraint is only partially
correct?
6Explicit Solution Candidates
- In ordinary optimization problems, most
algorithms do not generate plausible solutions in
the interim - However, there are advantages to have some
solution candidates when there are errors in the
constraints.
7Plausible Solution Candidates
- For some optimization problems, machine learning
approaches generate plausible solutions in the
interim. - Solutions are getting better while the machine
learning approach refines solution patterns
iteratively. - A better solution emerges from the cooperation of
plausible solution candidates.
8Fitness Landscape
- Each solution candidate has its fitness score for
the optimization problem. - A fitness landscape shows the fitness
distribution of the whole search space. - Solution candidates are ranked by fitness
judgment.
9Genetic Algorithm
- A search technique to find the exact or
approximate solutions to optimization problems. - It is based on the principle of evolution
- Survival of the fittest in Natural Selection
- Two basic processes from evolution
- Inheritance (passing of features from one
generation to the next) - Competition (survival of the fittest)
10Basic description of GA
- Algorithm is started with a set of solutions
(represented by chromosomes) called population. - Solutions from one population are taken and used
to form a new population. - The new population (offspring) will be better
than the old one (parent). - Solutions which are selected to form new
solutions are selected according to their fitness
- the more suitable they are the more chances
they have to reproduce.
11GA in Pseudo-code
- Choose initial population
- Evaluate the fitness of each individual in the
population - Repeat
- Select best-ranking individuals to reproduce
- Breed new generation through crossover and
mutation (genetic operations) and give birth to
offspring - Evaluate the individual fitness of the offspring
- Replace worst ranked part of population with
offspring - Until termination
12Building Block Hypothesis
- Building block a short and highly fit schema
providing benefit for the solution. - The global optimal solution is made up of
building blocks. - Identify, recombine, and resample small building
blocks to form a new solution with potentially
higher fitness. - By working with these particular building blocks,
we have reduced the complexity of our problem.
13The Fitness Function
- Plays the role of a judge
- Give more scores if the individual owns more
building blocks - Refine the fitness function based on the
evolution results
14Physical Mapping
15Cutting and reassembling for DNA sequence
- Cut a DNA sequence into small pieces in different
ways and reassemble them together
- the small pieces (called clones) are still too
large to find complete sequences - biologically, use probeto mark the clones
- each probe could mark several clones clone could
contain several probes
16The Physical Mapping Problem with Noisy Genomic
DataJournal of Computational Biology 10(5),
709-735, 2003
- Each row represents a clone Each column
represents a probe - Diagram on the left input clone-probe matrix
- Diagram on the right after probe arrangement the
clones are put in correct positions
17Consecutive Ones with Errors
18False Positives and False Negatives
19A genetic algorithm for physical mapping
- A two-stage genetic algorithm
- First stage generate the neighborhood
information among probes - Second stage generate the maximum length of
connecting probes
20The first stage of GA (GA1)
- Purpose find a probe ordering with the highest
fitness score for each clone. - Pseudo Code
- Random generate a population of probe
permutations - Evaluate the fitness of each individual in the
population - Repeat
- Select best-ranking individuals to reproduce
- Breed new generation through crossover and
mutation (genetic operations) and give birth to
offspring - Evaluate the individual fitnesses of the
offspring - Replace worst ranked part of population with
offspring - Until termination
21The first stage of GA (GA1)
4 1 2 3 5 8 6 9 11 12 13 14 15 17 18
? ? ?
? ?
? ? ?
? ? ? ? ? ? ?
? ? ?
? ? ? ? ?
? ? ? ?
Two building blocks that make partial
consecutive ones
? ? ? ?
22Crossover Operation
2 3 6 8 1 9 10 12 13 5 11 14 15 17 18
P1
9 10 11 12 13 14 8 18 17 6 5 3 2 1 15
P2
Child
2 3 6 8 1 9 10 11 12 13 14 18 17 5 15
2 3 6 8 1 9
2 3 6 8 1 9 10
2 3 6 8 1 9 10 11
2 3 6 8 1 9 10 11 12
23Mutations
2 3 6 8 1 9 10 12 13 5 11 12 15 17 18
2 3 6 8 5 9 10 12 13 1 11 12 15 17 18
24Detection of false Negatives
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
? ? ? ?
? ? ?
? ? ? ? ? ? ?
? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
25The first stage of GA (GA1)
- Construct the probe neighboring information
according to the GA1 results
1 2 3 5 6 8 9 10 11 12 13 14 15 17 18
Probe ordering result for probe segment 1
5 6 7 8 9 10 11 13 14 15 16 17 18 19 20
Probe ordering result for probe segment 2
.
83 85 86 87 88 89 90 91 92 93 95 96 97 98 99
Probe ordering result for probe segment 20
5 3, 6 6 5, 8 8 6, 9 18 17
5 6 6 5, 7 7 6, 8 20 19
5 3, 6 6 5, 7, 8 7 6, 8, 9 20 19
A neighboring probe list
Probe neighboring information
26The second stage of GA (GA2)
- Purpose find the longest connecting probe
sequence according to the probe neighboring
information. - Pseudo Code
- Random generate a population of probe
permutations - Evaluate the fitness of each individual in the
population - Repeat
- Select best-ranking individuals to reproduce
- Breed new generation through crossover and
mutation (genetic operations) and give birth to
offspring - Evaluate the individual fitnesses of the
offspring - Replace worst ranked part of population with
offspring - Until termination
27The second stage of GA (GA2)
- Generate a probe ordering according to the probe
neighboring information
1 2 2 1, 3 3 2, 4, 5 4 3, 5 5 3, 4,
6 6 5, 7, 8 7 6, 8, 9 99 97, 98
2 3 5 4 71 72 73 55 56 57 99 98 97 96
1 2 3 4 5 6 7 93 94 95 96 97 98 99