Genetic Algorithms and Cladistics - PowerPoint PPT Presentation

1 / 2
About This Presentation
Title:

Genetic Algorithms and Cladistics

Description:

Parents pass information on to offspring; higher fitness strings are more likely ... to pass on 'useful' information. A simple example: Evolve a string that ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 3
Provided by: travisjame
Category:

less

Transcript and Presenter's Notes

Title: Genetic Algorithms and Cladistics


1
Genetic Algorithms and Cladistics Clare Bates
Congdon, Department of Computer Science, Colby
College Emily F. Greenfest, Committee on
Evolutionary Biology, University of Chicago Josh
Ladieu 02 , Department of Computer Science,
Colby College
  • GA Methodology (and hard problems)
  • The GA tends to converge after many generations.
    That is, the solutions in the population start to
    look similar.
  • If the population has converged, the GA may be
    stuck on a local optima
  • For hard problems
  • Want to discourage premature convergence
  • Have to do multiple independent runs -- which
    will often land on different local optima

An overview of Genetic Algorithms
A simple example Evolve a string that contains
all 1s Population bit strings (1s and
0s) Fitness function count of the number of 1s
  • Replacement How strings die off
  • In general, the reproduction process produces a
    new generation of solutions and the offspring
    generation entirely replaces the parent
    generation.
  • However, this could mean that you lose your best
    solution.
  • In practice, it is often better to save the best
    individuals into the next generation.
  • Genetic Algorithms (GAs) are an approach to
    problem solving
  • that is inspired by the ideas of natural
    selection and evolution. The approach has
    demonstrated some success on complex problems
    (and also in working with nonlinearities).
  • In GAs
  • There is a population of possible solutions to
    a problem
  • Solutions combine with other solutions to form
    new solutions
  • In the act of combining, mutations may occur
  • Selection favors better solutions to the problem
  • Over a series of generations, the population
    includes better and
  • better solutions to the problem
  • Some common features of the GA approach
  • A population of strings -- represent potential
    solutions to a problem. (e.g.,
    1010100111110111)
  • An evaluation measure -- a fitness function
    that assigns a value to each solution. (E.g., 11
    1s in the string)
  • A method for parent selection -- a stochastic
    means of identifying good candidates for
    reproduction. (E.g., roulette)
  • Reproduction operators -- mechanisms for creating
    new solutions based on parents (E.g., crossover
    and mutation)
  • A Replacement method -- a means of removing
    lower-fitness solutions from the population
    (E.g., discrete generations)

Continue
The reproduction cycle (select parents,
reproduce, replace) continues for a preset number
of generations or until the population
has converged into similar solutions to the
problem
  • Some GA References
  • L. Davis. Handbook of Genetic Algorithms. Van
    Nostrand Reinhold, New York, NY, 1991
  • D.E. Golberg. Genetic Algorithms in Search
    Optimization and Machine Learning.
    Addison-Wesley, Reading, MA, 1989
  • M. Mitchell. An Introduction to Genetic
    Algorithms. MIT Press, Cambridge, MA, 1996
  • Some Cladistics References
  • Donahuge et al. Treebase A database of
    phylogenetic knowledge. Online at
    http//phylogeny.harvard.edu/treebase
  • J. Felsenstein. Phylip source code and
    documentation. 1995. Online at http//evolution.ge
    netics.washington.edu/phylip.html
  • P.L. Forey, C.J. Humphries, I.L. Kitching, R.W.
    Scotland, D.J. Siebert, and D.M. Williams.
    Cladistics A Practical Course in Systematics.
    Number 10 in the Systematics Association.
    Clarendon Press, Oxford, 1993
  • D.L. Swofford, G.J. Olsen, P.J. Waddell, and
    D.M. Hillis. Molecular Systematics, chapter
    Phylogenetic Inference, pages 407-514. Sinauer
    Associates, Inc., Sunderland, MA, 1996.
  • Solutions are initially generated at random
  • Selection Roulette Wheel
  • Each solution has a wedge on the wheel
  • Higher fitness strings have a bigger wedge
  • Repeatedly spin the wheel to select parents

Reproduction How new strings are produced from
parents
  • The process samples with replacement
  • All strings have a chance to be a parent
  • A string may be a parent more than once
  • or not at all
  • Parents pass information on to offspring higher
    fitness strings are more likely to be parents and
  • to pass on useful information

Crossover Two parents swap bits with each
other Mutation Random bits are flipped (from 0
to 1 or vice versa)
2
Cladistics (as told by a computer scientist) Used
by biologists to investigate the evolutionary
relationships among species currently or formerly
inhabiting the earth Data sets contain several
related species, with attribute-value
information for relevant features
Why heuristic search? The search space is large
enough for most cladistics problems that an
exhaustive search is impractical. (A heuristic
search is a clever, but non-exhaustive search.)
For example
  • Variations and related approaches
  • Using Wagner parsimony, there are many different
    ways to represent the same tree -- a different
    species may be at the root as long as the
    relationships between species remains the same.
  • Other approaches assume, for example, that
    characters can be gained, but not lost, via
    evolution. The species at the root is significant
    in these approaches.
  • Molecular data (genetic AGCT) is increasingly
    used instead of phenotypes (physical traits) the
    metrics used to evaluate these trees are more
    complex.

Species are organized into trees or networks,
called cladograms, which represent hypothesized
evolutionary relationships between the species
In a cladogram, the species are the leaf nodes
the branches from interior nodes represent a
divergence in one or more attribute-values.
Cladists talk about characters and character
states rather than attributes and values. Instead
of species cladists would typically say taxa
Hydrostac 11100111011101100000011100010
Cammitric 11100111010100001100101100111
Mamiaceae 01100000010000001101000100111
Phrymacea 11100000010000010111000110011
Verbenace 01000000010000001101000100011
Hippurida 11100010011110010110101100011
Mendoncia 01101000010000000101000001011
Thunbergi 01101000010000000100000000011
Nemsoniac 11001000110001100000010100001
Acanthace 01101000110000000000000100011
Martyniac 11000000110001100000010100011
Trapemmac 11100000110010010100100010011
Pedamiace 11000000110000000000000100011 ...
Number of Number of Possible Species Unrooted
Trees 3 1 4 3 5 15 6 105 7 945 8 10,395 9 1
35,135 10 2,027,025 11 34,459,425 23 1.3
x 10 25 49 13.0 x 10 72
Example data set with four species and three
features A 0 0 0 B 1 0 0 C 1 1 0 D 1 1 1
This is a tree as drawn by a computer scientist.
It grows down, from the root at the top to the
leaves at the bottom. Cladists generally draw
their trees growing up
An example of a most parsimonious tree for this
data.
  • The typical cladistics algorithm performs a
    heuristic search
  • The tree is built incrementally, adding one
    species at a time.
  • Searching determines where to add the next
    species.
  • And, with some algorithms, searching determines
    whether to transform the tree in specific ways,
    such as switching leaf nodes or moving subtrees
    (swapping or grafting)

Trees are evaluated with metrics such as (Wagner)
parsimony Fewer evolutionary changes are
considered more plausible. There are many
alternative metrics, each of which embodies
different assumptions about evolution and
restrictions on the resulting cladograms. We
chose Wagner parsimony in part as a convenient
starting point It is straightforward (fast) to
calculate.
We are currently working with binary data, but
that is not an assumption of cladistics. Its
just a convenient starting point
Crossover Operator
Results
Gaphyl GA Phylip
Challenge To design a meaningful representation
and GA reproduction operators for for the
cladistics domain
Experiments have been run on two different
datasets (Both from TreeBase) Lamiiflorae 23
species 29 attributes Angiosperms 49 species 61
attributes
1. From parent1, chose a subtree at random from
parent2, chose the smallest subtree that includes
all species from the first subtree 2. Replace
subtree of parent1 with subtree from parent2 3.
Identify duplicate species introduced by the new
subtree and prune offspring to remove the older
branches that contain duplicates
This research represents an exploration into the
utility of using the GA approach to search for
cladograms. The GA is often successful in
searching large and complex search spaces, and
provides an alternative to the usual heuristic
search methods used in cladistics. Phylip is a
commonly used cladistics package, with freely
available source code.
  • Although the above explanation of the GA uses
    strings, some researchers use the GA approach
    with a tree representation
  • For example, a standard crossover operator
    for a tree
  • representation will swap subtrees in the two
    parents.
  • However, the cladistics task provides an
    additional challenge
  • Each species must be represented exactly once in
    the tree
  • The offspring should inherit meaningful
    information from the parents (for the GA approach
    to work)

With 23 species, Gaphyl finds the same set of
trees as Phylip in comparable run time (about 1.5
hours) With 49 species, Gaphyl finds trees of
the same best fitness, but finds more of them
than Phylip does in comparable run time (about 12
hours) 222 distinct trees found by Gaphyl 40
distinct trees found by Phylip This suggests
that on larger problems, Gaphyl may be able to
find trees that Phylip cannot.
In Gaphyl, we use the GA to do the search and
code from Phylip to evaluate the cladograms
  • Recap on what we have so far
  • A large search space (of trees) the space is
    large enough that heuristic search is necessary
  • A well-defined fitness function for the GA
    (Wagner parsimony, via Phylip code)

Mutation Operators (not hard for offspring to
resemble parent)
Observation of cladistics task In general, a
good parent will have closely related species
near to each other in a subtree We would like our
GA operators to have a chance to be able to pass
this information on to offspring
1. Swap two random leaf nodes (species) 2. Swap
two random subtrees 3. Prune random species and
add it back as an offspring of the root 4. Pick a
random subtree and put it into canonical form
  • Not explained in this poster
  • Canonical form used to recognize duplicates WRT
    Wagner
  • Two-step GA process used (run and run again) to
    combat premature convergence
Write a Comment
User Comments (0)
About PowerShow.com