Title: CHAPTER 10 EVOLUTIONARY COMPUTATION II: GENERAL METHODS AND THEORY
1CHAPTER 10EVOLUTIONARY COMPUTATION II GENERAL
METHODS AND THEORY
Slides for Introduction to Stochastic Search and
Optimization (ISSO) by J. C. Spall
- Organization of chapter in ISSO
- Introduction
- Evolution strategy and evolutionary programming
comparisons with GAs - Schema theory for GAs
- What makes a problem hard?
- Convergence theory
- No free lunch theorems
2Methods of EC
- Genetic algorithms (GAs), evolution strategy
(ES), and evolutionary programming (EP) are most
common EC methods - Many modern EC implementations borrow aspects
from one or more EC methods - Generally ES generally for function
optimization EP for AI applications such as
automatic programming
3ES Algorithm with Noise-Free Loss Measurements
- Step 0 (initialization) Randomly or
deterministically generate initial population of
N values of ? ? ? and evaluate L for each of the
values. - Step 1 (offspring) Generate ? offspring from
current population of N candidate ? values such
that all ? values satisfy direct or indirect
constraints on ?. - Step 2 (selection) For (N???)-ES, select N best
values from combined population of N original
values plus ? offspring for (N,??)-ES, select N
best values from population of ? gt N offspring
only. - Step 3 (repeat or terminate) Repeat steps 1 and 2
or terminate.
4Schema Theory for GAs
- Key innovation in Holland (1975) is a form of
theoretical foundation for GAs based on schemas - Represents first attempt at serious theoretical
analysis - But not entirely successful, as leap of faith
required to relate schema theory to actual
convergence of GA - GAs work by discovering, emphasizing, and
recombining good building blocks of solutions
in a highly parallel fashion. (Melanie Mitchell,
An Introduction to Genetic Algorithms p. 27,
1996, paraphrasing John Holland) - Statement above more intuitive than formal
- Notion of building block is characterized via
schemas - Schemas are propagated or destroyed according to
the laws of probability
5Schema Theory for GAs
- Schema is template for chromosomes in GAs
- Example 1 0 1, where the symbol
represents a dont care (or free) element - 1?1?0?0?1?1?0?1 is specific instance of this
schema - Schemas sometimes called building blocks of GAs
- Two fundamental results Schema theorem and
implicit parallelism - Schema theorem says that better templates
dominate the population as generations proceed - Implicit parallelism says that GA processes gtgt N
schemas at each iteration - Schema theory is controversial
- Not connected to algorithm performance in same
direct way as usual convergence theory for
iterates of algorithm
6Convergence Theory via Markov Chains
- Schema theory inadequate
- Mathematics behind schema theory not fully
rigorous - Unjustified claims about implications of schema
theory - More rigorous convergence theory exists
- Pertains to noise-free loss (fitness)
measurements - Pertains to finite representation (e.g., bit
coding or floating point representation on
digital computer) - Convergence theory relies on Markov chains
- Each state in chain represents possible
population - Markov transition matrix P contains all
information for Markov chain analysis
7GA Markov Chain Model
- GAs with binary bit coding can be modeled as
(discrete state) Markov chains - Recall states in chain represent possible
populations - i?th element of probability vector pk represents
probability of achieving i?th population at
iteration k - Transition matrix The i, j element of P
represents the probability of population i
producing population j through the selection,
crossover and mutation operations - Depends on loss (fitness) function, selection
method, and reproduction and mutation parameters - Given transition matrix P, it is known that
8Rudolph (1994) and Markov Chain Analysis for
Canonical GA
- Rudolph (1994, IEEE Trans. Neural Nets.) uses
Markov chain analysis to study canonical GA
(CGA) - CGA includes binary bit coding, crossover,
mutation, and roulette wheel selection - CGA is focus of seminal book, Holland (1975)
- CGA does not include elitism?lack of elitism is
critical aspect of theoretical analysis - CGA assumes mutation probability 0 lt Pm lt 1 and
single-point crossover probability 0 ? Pc ? 1 - Key preliminary result CGA is ergodic Markov
chain - Exists a unique limiting distribution for the
states of chain - Nonzero probability of being in any state
regardless of initial condition
9Rudolph (1994) and Markov Chain Analysis for CGA
(contd)
- Ergodicity for CGA provides a negative result on
convergence in Rudolph (1994) - Let denote lowest of N ( population
size) loss values within population at iteration
k - represents loss value for ? in
population k that has maximum fitness value - Main theorem CGA satisfies
- (above limit on left-hand side exists by
ergodicity) - Implies CGA does not converge to the global
optimum
10Rudolph (1994) and Markov Chain Analysis for CGA
(contd)
- Fundamental problem with CGA is that optimal
solutions are found but then lost - CGA has no mechanism for retaining optimal
solution - Rudolph discusses modification to CGA yielding
positive convergence results - Appends super individual to each population
- Super individual represents best chromosome so
far - Not eligible for GA operations (selection,
crossover, mutation) - Not same as elitism
- CGA with added super individual converges in
probability
11Contrast of Suzuki (1995) and Rudolph (1994) in
Markov Chain Analysis for GA
- Suzuki (1995, IEEE Trans. Systems, Man, and
Cyber.) uses Markov chain analysis to study GA
with elitism - Same as CGA of Rudolph (1994) except for elitism
- Suzuki (1995) only considers unique states
(populations) - Rudolph (1994) includes redundant states
- With N population size and B no. of
bits/chromosome -
unique states in Suzuki (1995), - 2NB states in Rudolph (1994) (much larger than
number of unique states above) - Above affects bookkeeping does not fundamentally
change relative results of Suzuki (1995) and
Rudolph (1994)
12Convergence Under Elitism
- In both CGA case (Rudolph, 1994) and case with
elitism (Suzuki, 1995) the limit exists - (dimension of differs according to
definition of states, unique or nonunique as on
previous slide) - Suzuki (1995) assumes each population includes
one elite element and that crossover probability
Pc 1 - Let represent j?th element of , and J
represent indices j where population j includes
chromosome achieving L(??) - Then from Suzuki (1995)
- Implies GA with elitism converges in probability
to set of optima
13Calculation of Stationary Distribution
- Markov chain theory provides useful conceptual
device - Practical calculation difficult due to explosive
growth of number of possible populations (states) - Growth is in terms of factorials of N and bit
string length (B) - Practical calculation of pk usually impossible
due to difficulty in getting P - Transition matrix can be very large in practice
- E.g., if N B 6, P is 108??108 matrix!
- Real problems have N and B much larger than 6
- Ongoing work attempts to severely reduce
dimension by limiting states to only most
important (e.g., Spears, 1999 Moey and Rowe,
2004)
14Example 10.2 from ISSO Markov Chain Calculations
for Small-Scale Implementation
- Consider L(?) ? ? ?
0,?15 - Function has local and global minimum plot on
next slide - Several GA implementations with very small
population sizes (N) and numbers of bits (B) - Small scale implementations imply Markov
transition matrices are computable - But still not trivial, as matrix dimensions
range from approximately 2000?2000 to 4000?4000
15Loss Function for Example 10.2 in ISSOMarkov
chain theory provides probability of finding
solution (?? 15) in given number of iterations
16Example 10.2 (contd) Probability Calculations
for Very Small-Scale GAs
17Summary of GA Convergence Theory
- Schema theory (Holland, 1975) was most popular
method for theoretical analysis until
approximately mid-1990s - Schema theory not fully rigorous and not fully
connected to actual algorithm performance - Markov chain theory provides more formal means of
convergenceand convergence rateanalysis - Rudolph (1994) used Markov chains to provide
largely negative result on convergence for
canonical GAs - Canonical GA does not converge to optimum
- Suzuki (1995) considered GAs with elitism unlike
Rudolph (1994), GA is now convergent - Challenges exist in practical calculation of
Markov transition matrix
18No Free Lunch Theorems (Reprise, Chap. 1)
- No free lunch (NFL) Theorems apply to EC
algorithms - Theorems imply there can be no universally
efficient EC algorithm - Performance of one algorithm when averaged over
all problems is identical to that of any other
algorithm - Suppose EC algorithm A applied to loss L
- Let denote lowest loss value from most
recent N population elements after n ? N unique
function evaluations - Consider the probability that after n
unique evaluations of the loss
NFL theorems state that the sum of above
probabilities over all loss functions is
independent of A
19Comparison of Algorithms for Stochastic
Optimization in Chaps. 2 10 of ISSO
- Table next slide is rough summary of relative
merits of several algorithms for stochastic
optimization - Comparisons based on semi-subjective impressions
from numerical experience (author and others) and
theoretical or analytical evidence - NFL theorems not generally relevant as only
considering typical problems of interest, not
all possible problems - Table does not consider root-finding per se
- Table is for basic implementation forms of
algorithms - Ratings range from L (low), ML (medium-low), M
(medium), MH (medium?high), and H (high) - These scales are for stochastic optimization
setting and have no meaning relative to classical
deterministic methods
20Comparison of Algorithms