CHAPTER 10 EVOLUTIONARY COMPUTATION II: GENERAL METHODS AND THEORY - PowerPoint PPT Presentation

About This Presentation
Title:

CHAPTER 10 EVOLUTIONARY COMPUTATION II: GENERAL METHODS AND THEORY

Description:

In both CGA case (Rudolph, 1994) and case with elitism (Suzuki, 1995) the limit exists: ... Suzuki (1995) assumes each population includes one elite element and that ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 21
Provided by: jims1
Learn more at: https://www.jhuapl.edu
Category:

less

Transcript and Presenter's Notes

Title: CHAPTER 10 EVOLUTIONARY COMPUTATION II: GENERAL METHODS AND THEORY


1
CHAPTER 10EVOLUTIONARY COMPUTATION II GENERAL
METHODS AND THEORY
Slides for Introduction to Stochastic Search and
Optimization (ISSO) by J. C. Spall
  • Organization of chapter in ISSO
  • Introduction
  • Evolution strategy and evolutionary programming
    comparisons with GAs
  • Schema theory for GAs
  • What makes a problem hard?
  • Convergence theory
  • No free lunch theorems

2
Methods of EC
  • Genetic algorithms (GAs), evolution strategy
    (ES), and evolutionary programming (EP) are most
    common EC methods
  • Many modern EC implementations borrow aspects
    from one or more EC methods
  • Generally ES generally for function
    optimization EP for AI applications such as
    automatic programming

3
ES Algorithm with Noise-Free Loss Measurements
  • Step 0 (initialization) Randomly or
    deterministically generate initial population of
    N values of ? ? ? and evaluate L for each of the
    values.
  • Step 1 (offspring) Generate ? offspring from
    current population of N candidate ? values such
    that all ? values satisfy direct or indirect
    constraints on ?.
  • Step 2 (selection) For (N???)-ES, select N best
    values from combined population of N original
    values plus ? offspring for (N,??)-ES, select N
    best values from population of ? gt N offspring
    only.
  • Step 3 (repeat or terminate) Repeat steps 1 and 2
    or terminate.

4
Schema Theory for GAs
  • Key innovation in Holland (1975) is a form of
    theoretical foundation for GAs based on schemas
  • Represents first attempt at serious theoretical
    analysis
  • But not entirely successful, as leap of faith
    required to relate schema theory to actual
    convergence of GA
  • GAs work by discovering, emphasizing, and
    recombining good building blocks of solutions
    in a highly parallel fashion. (Melanie Mitchell,
    An Introduction to Genetic Algorithms p. 27,
    1996, paraphrasing John Holland)
  • Statement above more intuitive than formal
  • Notion of building block is characterized via
    schemas
  • Schemas are propagated or destroyed according to
    the laws of probability

5
Schema Theory for GAs
  • Schema is template for chromosomes in GAs
  • Example 1 0 1, where the symbol
    represents a dont care (or free) element
  • 1?1?0?0?1?1?0?1 is specific instance of this
    schema
  • Schemas sometimes called building blocks of GAs
  • Two fundamental results Schema theorem and
    implicit parallelism
  • Schema theorem says that better templates
    dominate the population as generations proceed
  • Implicit parallelism says that GA processes gtgt N
    schemas at each iteration
  • Schema theory is controversial
  • Not connected to algorithm performance in same
    direct way as usual convergence theory for
    iterates of algorithm

6
Convergence Theory via Markov Chains
  • Schema theory inadequate
  • Mathematics behind schema theory not fully
    rigorous
  • Unjustified claims about implications of schema
    theory
  • More rigorous convergence theory exists
  • Pertains to noise-free loss (fitness)
    measurements
  • Pertains to finite representation (e.g., bit
    coding or floating point representation on
    digital computer)
  • Convergence theory relies on Markov chains
  • Each state in chain represents possible
    population
  • Markov transition matrix P contains all
    information for Markov chain analysis

7
GA Markov Chain Model
  • GAs with binary bit coding can be modeled as
    (discrete state) Markov chains
  • Recall states in chain represent possible
    populations
  • i?th element of probability vector pk represents
    probability of achieving i?th population at
    iteration k
  • Transition matrix The i, j element of P
    represents the probability of population i
    producing population j through the selection,
    crossover and mutation operations
  • Depends on loss (fitness) function, selection
    method, and reproduction and mutation parameters
  • Given transition matrix P, it is known that

8
Rudolph (1994) and Markov Chain Analysis for
Canonical GA
  • Rudolph (1994, IEEE Trans. Neural Nets.) uses
    Markov chain analysis to study canonical GA
    (CGA)
  • CGA includes binary bit coding, crossover,
    mutation, and roulette wheel selection
  • CGA is focus of seminal book, Holland (1975)
  • CGA does not include elitism?lack of elitism is
    critical aspect of theoretical analysis
  • CGA assumes mutation probability 0 lt Pm lt 1 and
    single-point crossover probability 0 ? Pc ? 1
  • Key preliminary result CGA is ergodic Markov
    chain
  • Exists a unique limiting distribution for the
    states of chain
  • Nonzero probability of being in any state
    regardless of initial condition

9
Rudolph (1994) and Markov Chain Analysis for CGA
(contd)
  • Ergodicity for CGA provides a negative result on
    convergence in Rudolph (1994)
  • Let denote lowest of N ( population
    size) loss values within population at iteration
    k
  • represents loss value for ? in
    population k that has maximum fitness value
  • Main theorem CGA satisfies
  • (above limit on left-hand side exists by
    ergodicity)
  • Implies CGA does not converge to the global
    optimum

10
Rudolph (1994) and Markov Chain Analysis for CGA
(contd)
  • Fundamental problem with CGA is that optimal
    solutions are found but then lost
  • CGA has no mechanism for retaining optimal
    solution
  • Rudolph discusses modification to CGA yielding
    positive convergence results
  • Appends super individual to each population
  • Super individual represents best chromosome so
    far
  • Not eligible for GA operations (selection,
    crossover, mutation)
  • Not same as elitism
  • CGA with added super individual converges in
    probability

11
Contrast of Suzuki (1995) and Rudolph (1994) in
Markov Chain Analysis for GA
  • Suzuki (1995, IEEE Trans. Systems, Man, and
    Cyber.) uses Markov chain analysis to study GA
    with elitism
  • Same as CGA of Rudolph (1994) except for elitism
  • Suzuki (1995) only considers unique states
    (populations)
  • Rudolph (1994) includes redundant states
  • With N population size and B no. of
    bits/chromosome

  • unique states in Suzuki (1995),
  • 2NB states in Rudolph (1994) (much larger than
    number of unique states above)
  • Above affects bookkeeping does not fundamentally
    change relative results of Suzuki (1995) and
    Rudolph (1994)

12
Convergence Under Elitism
  • In both CGA case (Rudolph, 1994) and case with
    elitism (Suzuki, 1995) the limit exists
  • (dimension of differs according to
    definition of states, unique or nonunique as on
    previous slide)
  • Suzuki (1995) assumes each population includes
    one elite element and that crossover probability
    Pc 1
  • Let represent j?th element of , and J
    represent indices j where population j includes
    chromosome achieving L(??)
  • Then from Suzuki (1995)
  • Implies GA with elitism converges in probability
    to set of optima

13
Calculation of Stationary Distribution
  • Markov chain theory provides useful conceptual
    device
  • Practical calculation difficult due to explosive
    growth of number of possible populations (states)
  • Growth is in terms of factorials of N and bit
    string length (B)
  • Practical calculation of pk usually impossible
    due to difficulty in getting P
  • Transition matrix can be very large in practice
  • E.g., if N B 6, P is 108??108 matrix!
  • Real problems have N and B much larger than 6
  • Ongoing work attempts to severely reduce
    dimension by limiting states to only most
    important (e.g., Spears, 1999 Moey and Rowe,
    2004)

14
Example 10.2 from ISSO Markov Chain Calculations
for Small-Scale Implementation
  • Consider L(?) ? ? ?
    0,?15
  • Function has local and global minimum plot on
    next slide
  • Several GA implementations with very small
    population sizes (N) and numbers of bits (B)
  • Small scale implementations imply Markov
    transition matrices are computable
  • But still not trivial, as matrix dimensions
    range from approximately 2000?2000 to 4000?4000

15
Loss Function for Example 10.2 in ISSOMarkov
chain theory provides probability of finding
solution (?? 15) in given number of iterations
16
Example 10.2 (contd) Probability Calculations
for Very Small-Scale GAs
17
Summary of GA Convergence Theory
  • Schema theory (Holland, 1975) was most popular
    method for theoretical analysis until
    approximately mid-1990s
  • Schema theory not fully rigorous and not fully
    connected to actual algorithm performance
  • Markov chain theory provides more formal means of
    convergenceand convergence rateanalysis
  • Rudolph (1994) used Markov chains to provide
    largely negative result on convergence for
    canonical GAs
  • Canonical GA does not converge to optimum
  • Suzuki (1995) considered GAs with elitism unlike
    Rudolph (1994), GA is now convergent
  • Challenges exist in practical calculation of
    Markov transition matrix

18
No Free Lunch Theorems (Reprise, Chap. 1)
  • No free lunch (NFL) Theorems apply to EC
    algorithms
  • Theorems imply there can be no universally
    efficient EC algorithm
  • Performance of one algorithm when averaged over
    all problems is identical to that of any other
    algorithm
  • Suppose EC algorithm A applied to loss L
  • Let denote lowest loss value from most
    recent N population elements after n ? N unique
    function evaluations
  • Consider the probability that after n
    unique evaluations of the loss

NFL theorems state that the sum of above
probabilities over all loss functions is
independent of A
19
Comparison of Algorithms for Stochastic
Optimization in Chaps. 2 10 of ISSO
  • Table next slide is rough summary of relative
    merits of several algorithms for stochastic
    optimization
  • Comparisons based on semi-subjective impressions
    from numerical experience (author and others) and
    theoretical or analytical evidence
  • NFL theorems not generally relevant as only
    considering typical problems of interest, not
    all possible problems
  • Table does not consider root-finding per se
  • Table is for basic implementation forms of
    algorithms
  • Ratings range from L (low), ML (medium-low), M
    (medium), MH (medium?high), and H (high)
  • These scales are for stochastic optimization
    setting and have no meaning relative to classical
    deterministic methods

20
Comparison of Algorithms
Write a Comment
User Comments (0)
About PowerShow.com