CS 478 Machine Learning

About This Presentation

Title:

CS 478 Machine Learning

Description:

Individuals survive based on their ability to adapt to the pressures of their ... Fitter individuals tend to have more offspring, thus driving the population as a ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 37

Provided by: mauc3

Category:

more less

Transcript and Presenter's Notes

Title: CS 478 Machine Learning

1
CS 478 - Machine Learning

Genetic Algorithms (I)

2
Darwins Origin of Species Basic Principles (I)

Individuals survive based on their ability to
adapt to the pressures of their environment
(i.e., their fitness)
Fitter individuals tend to have more offspring,
thus driving the population as a whole towards
favorable traits
During reproduction, the traits found in parents
are passed onto their offspring
In sexual reproduction, the chromosomes of the
offspring are a mix of those of their parents
The traits of offspring are partially inherited
from their parents and partially the result of
new genes/traits created during the process of
reproduction
Nature produces individuals with differing traits
Over long periods, variations can accumulate,
producing entirely new species whose traits make
them especially suited to particular ecological
niches

3
Darwins Origin of Species Basic Principles (II)

Evolution is effected via two main genetic
mechanisms
Crossover
Take 2 candidate chromosomes
Randomly choose 1 or 2 crossover points
Swap the respective components to create 2 new
chromosomes
Mutation
Choose a single offspring
Randomly change some aspect of it

4
Intuition

Essentially a pseudo-random walk through the
population with the aim of maximizing some
fitness function
From a starting population
Crossover ensures exploitation
Mutation ensures exploration
GAs are based on these principles

5
Natural vs Artificial

Individual
Population
Fitness
Chromosome
Gene
Crossover and mutation
Natural selection

Candidate solution
Set of candidate solutions
Measure of quality of solutions
Encoding of candidate solutions
Part of the encoding of a solution
Search operators
Re-use of good (sub-)solutions

6
Phenotype vs. Genotype

In Genetic Algorithms (GAs), there is a clear
distinction between phenotype (i.e., the actual
individual or solution) and genotype (i.e., the
individual's encoding or chromosome). The GA, as
in nature, acts on genotypes only. Hence, the
natural process of growth must be implemented as
a genotype-to-phenotype decoding.
The original formulation of genetic algorithms
relied on a binary-encoding of solutions, where
chromosomes are strings of 0s and 1s. Individuals
can then be anything so long as there is a way of
encoding/decoding them using binary strings.

7
Simple GA

One often distinguishes between two types of
genetic algorithms, based on whether there is a
complete or partial replacement of the population
between generations (i.e., whether there is
overlap or not between generations).
When there is complete replacement, the GA is
said to generational, whilst when replacement is
only partial, the GA is said to be steady-state.
If you look carefully at the algorithms below,
you will notice that even the generational GA
gives only partial replacement when cloning takes
place (i.e., cloning causes overlap between
generations). Moreover, if steady-state is
performed on the whole population (rather than on
a proportion of fittest individuals), then the GA
is generational.
Hence, the distinction is more a matter of how
reproduction takes place than a matter of overlap.

8
Generational GA

Randomly generate a population of chromosomes
While (termination condition not met)
Decode chromosomes into individuals
Evaluate fitness of all individuals
Select fittest individuals
Generate new population by cloning, crossover and
mutation

9
Steady-state GA

Randomly generate a population of chromosomes
While (termination condition not met)
Decode chromosomes into individuals
Evaluate fitness of all individuals
Select fittest individuals
Produce offspring by crossover and mutation
Replace weakest individuals with offspring

10
Genetic Encoding / Decoding

We focus on binary encodings of solutions
We first look at single parameters (i.e., single
gene chromosomes) and then vectors of parameters
(i.e., multi-gene chromosomes)

11
Integer Parameters

Let p be the parameter to be encoded. There are
three distinct cases to consider
p takes values from 0, 1, ..., 2N-1 for some N
Then p can be encoded directly by its equivalent
binary representation
p takes values from M, M1, ..., M2N-1 for
some M and N
Then (p - M) can be encoded directly by its
equivalent binary representation.
p takes values from 0, 1, ..., L-1 for some L
such that there exists no N for which L2N
Then there are two possibilities clipping or
scaling

12
Clipping

Clipping consists of taking Nlog(L)1 bits and
encoding all parameter values 0 ? p ? L-2 by
their equivalent binary representation, letting
all other N-bit strings serve as encodings of
pL-1.
For example, assume p takes values in 0, 1, 2,
3, 4, 5, i.e., L6. Then Nlog(6)13.
Here, not only is 101 an (expected) encoding of
pL-15, but so are 110 and 111
Advantages easy to implement.
Disadvantages strong representational bias,
i.e., all parameter values between 0 and L-2 have
a single encoding, whilst the single parameter
value L-1 has 2N-L1 encodings.

13
Scaling

Scaling consists of taking Nlog(L)1 bits and
encoding p by the binary representation of the
integer value e such that p e(L-1)/(2N-1)
For example, assume p takes values in 0, 1, 2,
3, 4, 5, i.e., L6. Then Nlog(6)13.
Here, the binary encodings are not generally
numerically equivalent to the integer values they
code
Advantages easy to implement and smaller
representational bias than clipping (each value
of p has 1 or 2 encodings, with double encodings
evenly spread over the values of p)
Disadvantages more computation needed and still
a small representational bias

14
Real-valued Parameters (I)

Real values may be encoded as fixed point numbers
or integers via scaling and quantization
If p ranges over min, max, then p is encoded by
the binary representation of the integer part of

15
Real-valued Parameters (II)

Real values may also be encoded using thermometer
encoding
Let T be an integer greater than 1
Thermometer encoding of real values on T bits
consists of normalizing all real values to the
interval 0,1 and converting each normalized
value x to a bit-string of xT (rounded down) 1s
followed by trailing 0s as needed.

16
Vectors of Parameters

Vectors of parameters are encoded on multi-gene
chromosomes by combining the encodings of each
individual parameter
Let eibi0, ..., biN be the encoding of the ith
of M parameters
There are two possibilities for combining the ei
's onto a chromosome
Concatenating Here, individual encodings simply
follow each other in some pre-defined order,
e.g., b10, ..., b1N, ..., bM0, ..., bMN
Interleaving Here, the bits of each individual
encoding are interleaved, e.g., b10, ..., bM0,
..., b1N, ..., bMN
The order of parameters in the vector (resp.,
genes on the chromosome) is important, especially
for concatenated encodings

17
Gray Coding (I)

A Gray code represents each number in the
sequence of integers 0, 1, ..., 2N-1 as a binary
string of length N, such that adjacent integers
have representations that differ in only one bit
position
A number of different Gray codes exist. One
simple algorithm to produce gray codes starts
with all bits set to 0 and successively flips the
right-most bit that produces a new string

18
Gray Coding (II)
19
Gray Coding (III)

Advantages random bit-flips (e.g., during
mutation) are more likely to produce small
changes (i.e., there are no Hamming cliffs since
adjacent integers' representations differ by
exactly one bit).
Disadvantages big changes are rare but bigger
than with binary codes.
For example, consider the string 001. There are 3
possible bit flips leading to the strings 000,
011 and 101
With standard binary encoding, 2 of the 3 flips
lead to relatively large changes (from 001(1) to
011(3) and from 001(1) to 101(5),
respectively)
With Gray coding, 2 of the 3 flips produce small
changes (from 001(1) to 000(0) and from 001(1)
to 011(2), respectively)
However, the less probable (1 out of 3) flip from
001 to 101 produces a bigger change under Gray
coding (to 6) than under standard binary encoding
(to 5)

20
GA operators

We will restrict our discussion to binary strings
The basic GA operators are
Selection
Crossover
Mutation

21
Selection

Selection is the operation by which chromosomes
are selected for reproduction
Chromosomes corresponding to individuals with a
higher fitness have a higher probability of being
selected
There are a number of possible selection schemes
(we discuss some here)
Fitness-based selection makes the following
assumptions
There exists a known quality measure Q for the
solutions of the problem
Finding a solution can be achieved by maximizing
Q
For all potential solutions (good or bad), Q is
positive.
A chromosome's fitness is taken to be the quality
measure of the individual it encodes

22
Fitness-proportionate Selection

This selection scheme is the most widely used in
GAs
Let fi be the fitness value of individual i and
let favg be the average population fitness
Then, the probability of an individual i being
selected is given by

23
Roulette Wheel

Fitness-proportionate selection (FPS) can be
implemented with the roulette-wheel algorithm
A wheel is constructed with markers corresponding
to fitness values
For each fitness value fi, the size of the marker
(i.e., the proportion of the wheel's
circumference) associated to fi is given by pi as
defined above
Hence, when the wheel is spun, the probability of
the roulette landing on fi (and thus selecting
individual i) is given by pi, as expected

24
Vector Representation

A vector v of M elements from 1, ..., N is
constructed so that each subsequent i in 1, ...,
N has M.pi entries in v
A random index r from 1, ..., M is selected and
individual v(r) is selected
Example
4 individuals such that f1f210, f315 and f425
If M12, then v(1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4,
4)
Generate r6, then individual v(6)3 is selected

25
Cumulative Distribution

A random real-valued number r in

is chosen and individual i, such that

is selected (if i1, the lower bound sum is 0).

26
Discussion (I)

The implementation based on cumulative
distribution is effective but relatively
inefficient, whilst the implementation based on
vector representation is efficient but its
effectiveness depends on M (i.e., the value of M
determines the level of quantization of the pi's
and thus accuracy depends on M).
Assume that N individuals have to be selected for
reproduction. The expected number of copies of
each individual in the mating pool is

Hence, individuals with above-average fitness
tend to have more than one copy in the mating
pool, whilst individuals with below-average
fitness tend not to be copied. This leads to
problems with FPS.

27
Discussion (II)

Premature convergence
Assume an individual X with fi gtgt favg but fi ltlt
fmax is produced in an early generation. As Ni gtgt
1, the genes of X quickly spread all over the
population. At that point, crossover cannot
generate any new solutions (only mutation can)
and favg ltlt max forever.
Stagnation
Assume that at the end of a run (i.e., in one of
the consecutive generations) all individuals have
a relatively high and similar fitness, i.e., fi
is almost fmax for all i. Then, Ni is almost 1
for all i and there is virtually no selective
pressure.
Both of these problems can be solved with fitness
scaling techniques

28
Fitness Scaling

Essentially, fitness values are scaled down at
the beginning and scaled up towards the end
There are 3 general scaling methods
Linear scaling
f is replaced by fa.f b, where a and b are
chosen such that
favgfavg (i.e., the scaled average is the same
as the raw average)
fmaxc. favg (c is the number of expected copies
desired for the best individual usually c2)
The scaled fitness function may take on negative
values if there are a few bad individuals with
fitness much lower than favg and favg is close to
fmax . One solution is to arbitrarily assign the
value 0 to all negative fitness values.
Sigma truncation
f is replaced by f'f - (favg - c.?), where ? is
the population standard deviation, c is a
reasonable multiple of ? (usually 1 ? c ? 3) and
negative results are arbitrarily set to 0.
Truncation removes the problem of scaling to
negative values. (Note that truncated fitness
values may also be scaled if desired)
Power law scaling
f is replaced by f'fk for some suitable k. This
method is not used very often. In general, k is
problem-dependent and may require dynamic change
to stretch or shrink the range as needed

29
Rank Selection

All individuals are sorted by increasing values
of their fitness
Then, each individual is assigned a probability
pi of being selected from some prior probability
distribution
Typical distributions include
Linear Here, pia.i b
Negative exponential Here, pia.eb.i c
Rank selection (RS) has little biological
plausibility. However, it has the following
desirable features
No premature convergence. Because of the ranking
and the probability distribution imposed on it,
even less fit individuals will be selected (e.g.,
let there be 3 individuals such that f190, f27,
f33, and pi-0.4i 1.3. With FPS, p10.9 gtgt
p20.07 and p30.03, so that individual 1 comes
to saturate the population. With RS, p10.9,
p20.5 and p30.1, so that individual 2 is also
selected).
No stagnation. Even at the end, N1 ? N2 ? ...
(similar argument to above).
Explicit fitness values not needed. To order
individuals, only the ability of comparing pairs
of solutions is necessary.
However rank selection introduces a reordering
overhead and makes a theoretical analysis of
convergence difficult.

30
Tournament Selection

Tournament selection can be viewed as a noisy
version of rank selection.
The selection process is two-stage
Select a group of N (? 2) individuals
Select the individual with the highest fitness
from the group and discard all others
Tournament selection inherits the advantages of
rank selection. In addition, it does not require
global reordering and is more naturally-inspired.

31
Elitist Selection

The idea behind elitism is that at least one copy
of the best individual in the population is
always passed onto the next generation.
The main advantage is that convergence is
guaranteed (i.e., if the global maximum is
discovered, the GA converges to that maximum). By
the same token, however, there is a risk of being
trapped in a local maximum.
One alternative is to save the best individual so
far in some kind of register and, at the end of
each run, to designate it as the solution instead
of using the best of the last generation.

32
1-point Crossover

Here, the chromosomes of the parents are cut at
some randomly chosen common point and the
resulting sub-chromosomes are swapped
For example
P11010101010 and P21110001110
Crossover point between the 6th and 7th bits
Then the offspring are
O11010101110
O21110001010

33
2-point Crossover

Here, the chromosomes are thought of as rings
with the first and last gene connected (i.e.,
wrap-around structure)
The rings are cut in two sites and the resulting
sub-rings are swapped
For example
P11010101010 and P21110001110
Crossover points are between the 2nd and 3rd
bits, and between the 6th and 7th bits
Then the offspring are
O11110101110
O21010001010

34
Uniform Crossover

Here, each gene of the offspring is selected
randomly from the corresponding genes of the
parents
For example
P11010101010 and P21110001110
Then the offspring could be
O1110101110
Note produces a single offspring

35
Mutation

Mutation consists of making (usually small)
alterations to the values of one or more genes in
a chromosome
In binary chromosomes, it consists of flipping
random bits of the genotype. For example,
1010101010 may become 1011101010 if the 4th bit
is flipped.

36
Prototypical Steady-state GA

P ? p randomly generated hypotheses
For each h in P, compute fitness(h)
While maxh fitness(h) lt threshold ()
Ps ? Select r.p individuals from P (e.g., FPS,
RS, tournament)
Apply crossover to random pairs in Ps and add all
offspring to Po
Select m of the individuals in Po with uniform
probability and apply mutation (i.e., flip one of
their bits at random)
Pw ? r.p weakest individuals in P
P ? P Pw Po
For each h in P, compute fitness(h)