Evolutionary Reinforcement Learning Systems - PowerPoint PPT Presentation

About This Presentation

Title:

Evolutionary Reinforcement Learning Systems

Description:

... adn generates offspring based on the fitness of each solution in the task. ... This method sustain information about both good and bad state-action pairs. ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 32

Provided by: LINK88

Category:

more less

Transcript and Presenter's Notes

Title: Evolutionary Reinforcement Learning Systems

1
Evolutionary Reinforcement Learning Systems

Presented by Alp Sardag

2
Goal

Two main branches of reinforcement learning
Search the space of functions that asses values
to utility of states.
Search the space of functions that asses values
to behaviors (Q-learning).
Describe the evolutionary algorithm approach to
reinforcement learning and compare and contrast
with more common TD methods

3
Sequential Decision Task
This is an example of sequential decision task,
the optimal sequence For an agent that starts
from a1 is R,D,R,D,D,R,R,D
4
Learning from reinforcements

The EA and TD methods have similarities
Solve diffucult sequential decision tasks.
These methods are normally model-free, whereas
dynamic programming approaches learn from a
complete mathematical model of the underlying
system.

5
Supervised vs. Unsupervised Learning

In supervised learning the agent receives the
correct decisions paired with specific sensory
input.
In reinforcement learning, the agent does not
learn from examples of correct behavior but uses
system payoffs as aguide to form effective
decision policies.
A decision making may not receive feedback after
every decision.
They are applicable in problems where significant
domain knowledge is either unavailable or costly
to obtain.

6
TD Reinforcement Learning

Q-learning update scheme
Q(x,a)?Q(x,a)?(R(i)maxQ(y,b)-Q(x,a))
After a long-run
Q-values will converge to the optimal values.
A reinforcement learning system can thus use the
Q values to
evaluate each action that is possible from a
given state.

7
Evolutionary Algorithm

Evolutionary algorithms are global search
techniques.
They are built on Darwins theory of evolution by
natural selection.
Numerous potential solutions are encoded in
structures, called chromosomes.
During each iteration, the EA evaluates solutions
adn generates offspring based on the fitness of
each solution in the task.
Substructures, or genes, of the solutions are
then modified through genetic operators such as
mutation or recombination.
The idea structures that led to good solutions
in previous evaluations can be mutated or
combined to form even better solutions.

8
Basic Steps in Evolutionary Algorithm

Procedure EA
begin
t0
initialize P(t)
evaluate structures in P(t)
while termination condition not satisfied do
begin
tt1
select P(t) form P(t-1)
alter structures in P(t)
evaluate structures in P(t)
end
end.

9
Similarities EA and RL

EA method can learn online through the direct
interaction with the underlying system
They are adaptive to changes.
Online learning is often time-consuming and
dangerous. Teherfore researchers in EA and TD
learning train in simulation then apply the
learned control policy to real system.

10
Differences between TD and EA approach

This can be summarized along three dimensions
Policy representation
Credit assignment
Memory

11
Policy Representation

What they represent
TD methods form q(x,a)?v.
EA methods uses direct mapping from state to
recommended action (e.G. Q(x) ?a).
Conclusion TD try to solve a more diffucult
problem than reinforcement posed. In addition to
choosing best action, it tells us why.

12
The size of hypothesis space

The hypothesis space for direct policy
representation (e.g. Q(x) ?a)
The total number of functions cn
where n number of states c number of actions
The hypothesis space for a value function
representation (e.g. Q(x,a) ?v)
The total number of functions wcn
where w the number of possible values
NOTE the size of hypothesis space does not
reflect the difficulty of the problem northe
efficiency of the methods that search the space.

13
Credit Assignment

TD reinforcement learning Credit is chained
backward. In this manner, payoffs are distributed
across sequence of actions. A single reward value
becomes associated with each individual state and
decision pair.
EA reinforcement learning rewards are only
associated with sequences of decisions. Thus,
which individual decisions are most responsible
for a good or poor decision policy is irrelevant
to the EA.

14
Issue in Credit Assignment

In TD, the update is focused on a single state
and action.
In EA, after a recombination the evaluation is on
the entire sequence of actions.

15
Memory

TD reinforcement learning maintain statistics
concerning every state-action pairs. This method
sustain information about both good and bad
state-action pairs.
EA reinforcement learning maintain information
only about good states-action pairs. Thus, memory
losses occur.

16
Example

Unlike TD the information content decreases in
EA, population
actually decreases during learning.

State loss can also occur in EA reinforcement
learning.

17
Issues in RL

Exploration vs. Exploitation
Perceptual Aliasing
Generalization
Unstable Environments

18
Exploration vs. Exploitation for TD

Too much exploration could result in lower
average rewards and too little could prevent the
learning system from discovering new optimal
states.
Solution 1
Solution 2

19
Exploration vs. Exploitation for EA

Too much exploration could result in lower
average rewards and too little could prevent the
learning system from discovering new optimal
states.
Solution comes from the nature of EA. Early in an
evolutionary search, selection pressure is low
because most policies within the population have
lower fitness. As the evolution proceeds, highly
fit policy evolve, increasing the selective
pressure within in the population.

20
Perceptual Aliasing or Hidden State

In real world situations, the agent often will
not have access to complete information on the
state of its world.
TD methods are vulnerable to hidden states.
Ambiguous state information misleads the TD
method.
Since EA methods associate values with entire
decision sequences, credits are based on net
results.

21
Example
In TD reinforcement learning
22
Generalization

The generalization of policy desicions from one
area of the state space to another.
Since the number of possible states of the world
grows exponentially with the size of the task.
Solution is to apply action decisions from
observed states to unobserved states. (ANN, rule
bases)

23
Comparison of EA and TD

Generalization of TD In some large-scale
problems, approximating the value function works
well, while in some simple toy problems it fails.
In discrete table one update affects one state,
whereas in generalized case more than one state
and create noisiness. (who cares toy problems?)
Generalization of EA Since it make less frequent
update and base them on more global informatio,
it is more diffucult for single observation to
affect the global decision problem.

24
Unstable Environments

The agent must adapt its current decision policy
in response to changes that occur in its
world.(e.g. Faulty sensors, new obstructions,
etc.)
Because TD makes constant updates to decision
polic, it should respond to changes as soon as
they occur.
Since EA do not update any policy until an
individual or a population of individuals have
been completely evaluated over several actions,
the response to changes delayed.

25
Evolutionary Reinforcement Learning
Implementations

Learning Classifier Systems
SAMUEL
GENITOR
SANE

26
Learning Classifier Systems

Mesaages trigger Classifiers which are symbolic
if-then rules that map sensory input to action
Classifiers are in a competition, resolved by a
bidding algorithm.
Classifiers messages may trigger new classifiers.
Predate TD learning.
EA selects, mutates, and recombines classifiers
that received the most credit by bidding
algorithm.
The result was dissaponting.

27
SAMUEL

A system that learns to solve sequential decision
problems.
SAMUEL searches the space of decision policies
for stes of condition-action rules.
Each individual is a rule set that specifies its
behavior.
Each gene is a rule that maps states to actions
Three major components
Problem specific module task environment
simulation
Performance module interacts with the world,
obtain payoffs
Learning module uses GA to evolve behaviors.

28
GENITOR

GENITOR uses neural networks to represent the
decision policy, meaning generalization the state
space.
GENITOR relies solely on its EA to adjust the
weights in NN.
NN (individual) is represented in the population
as a sequence of connection weights.
The croosover is realized on the basis that the
offspring performance.
Genetic operators are applied aynchronously.

29
SANE

SANE (symbiotic, Adaptive Neuro-Evolution) was
designed as a fast, efficient method fro building
NN in domains where it is not possible to
generate trainning data.
NN forms a direct mapping from sensors to actions
and provides effective generalization over the
state space.
Individuals are complete NNs.
Two separate population
Population of neurons population of building
blocks of NN.
Population of network blueprints population of
combination of building blocks of NN.