Title: Genetic Algorithms and Cladistics
1 Genetic Algorithms and Cladistics Clare Bates
Congdon, Department of Computer Science, Colby
College Emily F. Greenfest, Committee on
Evolutionary Biology, University of Chicago Josh
Ladieu 02 , Department of Computer Science,
Colby College
- GA Methodology (and hard problems)
- The GA tends to converge after many generations.
That is, the solutions in the population start to
look similar. - If the population has converged, the GA may be
stuck on a local optima - For hard problems
- Want to discourage premature convergence
- Have to do multiple independent runs -- which
will often land on different local optima
An overview of Genetic Algorithms
A simple example Evolve a string that contains
all 1s Population bit strings (1s and
0s) Fitness function count of the number of 1s
- Replacement How strings die off
- In general, the reproduction process produces a
new generation of solutions and the offspring
generation entirely replaces the parent
generation. - However, this could mean that you lose your best
solution. - In practice, it is often better to save the best
individuals into the next generation.
- Genetic Algorithms (GAs) are an approach to
problem solving - that is inspired by the ideas of natural
selection and evolution. The approach has
demonstrated some success on complex problems
(and also in working with nonlinearities). - In GAs
- There is a population of possible solutions to
a problem - Solutions combine with other solutions to form
new solutions - In the act of combining, mutations may occur
- Selection favors better solutions to the problem
- Over a series of generations, the population
includes better and - better solutions to the problem
- Some common features of the GA approach
- A population of strings -- represent potential
solutions to a problem. (e.g.,
1010100111110111) - An evaluation measure -- a fitness function
that assigns a value to each solution. (E.g., 11
1s in the string) - A method for parent selection -- a stochastic
means of identifying good candidates for
reproduction. (E.g., roulette) - Reproduction operators -- mechanisms for creating
new solutions based on parents (E.g., crossover
and mutation) - A Replacement method -- a means of removing
lower-fitness solutions from the population
(E.g., discrete generations)
Continue
The reproduction cycle (select parents,
reproduce, replace) continues for a preset number
of generations or until the population
has converged into similar solutions to the
problem
- Some GA References
- L. Davis. Handbook of Genetic Algorithms. Van
Nostrand Reinhold, New York, NY, 1991 - D.E. Golberg. Genetic Algorithms in Search
Optimization and Machine Learning.
Addison-Wesley, Reading, MA, 1989 - M. Mitchell. An Introduction to Genetic
Algorithms. MIT Press, Cambridge, MA, 1996 - Some Cladistics References
- Donahuge et al. Treebase A database of
phylogenetic knowledge. Online at
http//phylogeny.harvard.edu/treebase - J. Felsenstein. Phylip source code and
documentation. 1995. Online at http//evolution.ge
netics.washington.edu/phylip.html - P.L. Forey, C.J. Humphries, I.L. Kitching, R.W.
Scotland, D.J. Siebert, and D.M. Williams.
Cladistics A Practical Course in Systematics.
Number 10 in the Systematics Association.
Clarendon Press, Oxford, 1993 - D.L. Swofford, G.J. Olsen, P.J. Waddell, and
D.M. Hillis. Molecular Systematics, chapter
Phylogenetic Inference, pages 407-514. Sinauer
Associates, Inc., Sunderland, MA, 1996.
- Solutions are initially generated at random
- Selection Roulette Wheel
- Each solution has a wedge on the wheel
- Higher fitness strings have a bigger wedge
- Repeatedly spin the wheel to select parents
Reproduction How new strings are produced from
parents
- The process samples with replacement
- All strings have a chance to be a parent
- A string may be a parent more than once
- or not at all
- Parents pass information on to offspring higher
fitness strings are more likely to be parents and
- to pass on useful information
Crossover Two parents swap bits with each
other Mutation Random bits are flipped (from 0
to 1 or vice versa)
2Cladistics (as told by a computer scientist) Used
by biologists to investigate the evolutionary
relationships among species currently or formerly
inhabiting the earth Data sets contain several
related species, with attribute-value
information for relevant features
Why heuristic search? The search space is large
enough for most cladistics problems that an
exhaustive search is impractical. (A heuristic
search is a clever, but non-exhaustive search.)
For example
- Variations and related approaches
- Using Wagner parsimony, there are many different
ways to represent the same tree -- a different
species may be at the root as long as the
relationships between species remains the same. - Other approaches assume, for example, that
characters can be gained, but not lost, via
evolution. The species at the root is significant
in these approaches. - Molecular data (genetic AGCT) is increasingly
used instead of phenotypes (physical traits) the
metrics used to evaluate these trees are more
complex.
Species are organized into trees or networks,
called cladograms, which represent hypothesized
evolutionary relationships between the species
In a cladogram, the species are the leaf nodes
the branches from interior nodes represent a
divergence in one or more attribute-values.
Cladists talk about characters and character
states rather than attributes and values. Instead
of species cladists would typically say taxa
Hydrostac 11100111011101100000011100010
Cammitric 11100111010100001100101100111
Mamiaceae 01100000010000001101000100111
Phrymacea 11100000010000010111000110011
Verbenace 01000000010000001101000100011
Hippurida 11100010011110010110101100011
Mendoncia 01101000010000000101000001011
Thunbergi 01101000010000000100000000011
Nemsoniac 11001000110001100000010100001
Acanthace 01101000110000000000000100011
Martyniac 11000000110001100000010100011
Trapemmac 11100000110010010100100010011
Pedamiace 11000000110000000000000100011 ...
Number of Number of Possible Species Unrooted
Trees 3 1 4 3 5 15 6 105 7 945 8 10,395 9 1
35,135 10 2,027,025 11 34,459,425 23 1.3
x 10 25 49 13.0 x 10 72
Example data set with four species and three
features A 0 0 0 B 1 0 0 C 1 1 0 D 1 1 1
This is a tree as drawn by a computer scientist.
It grows down, from the root at the top to the
leaves at the bottom. Cladists generally draw
their trees growing up
An example of a most parsimonious tree for this
data.
- The typical cladistics algorithm performs a
heuristic search - The tree is built incrementally, adding one
species at a time. - Searching determines where to add the next
species. - And, with some algorithms, searching determines
whether to transform the tree in specific ways,
such as switching leaf nodes or moving subtrees
(swapping or grafting)
Trees are evaluated with metrics such as (Wagner)
parsimony Fewer evolutionary changes are
considered more plausible. There are many
alternative metrics, each of which embodies
different assumptions about evolution and
restrictions on the resulting cladograms. We
chose Wagner parsimony in part as a convenient
starting point It is straightforward (fast) to
calculate.
We are currently working with binary data, but
that is not an assumption of cladistics. Its
just a convenient starting point
Crossover Operator
Results
Gaphyl GA Phylip
Challenge To design a meaningful representation
and GA reproduction operators for for the
cladistics domain
Experiments have been run on two different
datasets (Both from TreeBase) Lamiiflorae 23
species 29 attributes Angiosperms 49 species 61
attributes
1. From parent1, chose a subtree at random from
parent2, chose the smallest subtree that includes
all species from the first subtree 2. Replace
subtree of parent1 with subtree from parent2 3.
Identify duplicate species introduced by the new
subtree and prune offspring to remove the older
branches that contain duplicates
This research represents an exploration into the
utility of using the GA approach to search for
cladograms. The GA is often successful in
searching large and complex search spaces, and
provides an alternative to the usual heuristic
search methods used in cladistics. Phylip is a
commonly used cladistics package, with freely
available source code.
- Although the above explanation of the GA uses
strings, some researchers use the GA approach
with a tree representation - For example, a standard crossover operator
for a tree - representation will swap subtrees in the two
parents. - However, the cladistics task provides an
additional challenge - Each species must be represented exactly once in
the tree - The offspring should inherit meaningful
information from the parents (for the GA approach
to work)
With 23 species, Gaphyl finds the same set of
trees as Phylip in comparable run time (about 1.5
hours) With 49 species, Gaphyl finds trees of
the same best fitness, but finds more of them
than Phylip does in comparable run time (about 12
hours) 222 distinct trees found by Gaphyl 40
distinct trees found by Phylip This suggests
that on larger problems, Gaphyl may be able to
find trees that Phylip cannot.
In Gaphyl, we use the GA to do the search and
code from Phylip to evaluate the cladograms
- Recap on what we have so far
- A large search space (of trees) the space is
large enough that heuristic search is necessary - A well-defined fitness function for the GA
(Wagner parsimony, via Phylip code)
Mutation Operators (not hard for offspring to
resemble parent)
Observation of cladistics task In general, a
good parent will have closely related species
near to each other in a subtree We would like our
GA operators to have a chance to be able to pass
this information on to offspring
1. Swap two random leaf nodes (species) 2. Swap
two random subtrees 3. Prune random species and
add it back as an offspring of the root 4. Pick a
random subtree and put it into canonical form
- Not explained in this poster
- Canonical form used to recognize duplicates WRT
Wagner - Two-step GA process used (run and run again) to
combat premature convergence