Reinforcement Learning

1 / 61

About This Presentation

Title:

Reinforcement Learning

Description:

... xml.rels ppt/s/_rels/18.xml.rels ppt/s/_rels/17.xml. ... oleObject27.bin ppt/media/image37.png docProps/thumbnail.jpeg ppt/media/image42. ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 62

Provided by: con81

more less

Transcript and Presenter's Notes

Title: Reinforcement Learning

1
Reinforcement Learning
2
Overview

Introduction
Q-learning
Exploration Exploitation
Evaluating RL algorithms
On-Policy learning SARSA
Model-based Q-learning

3
What Does Q-Learning learn

Does Q-learning gives the agent an optimal
policy?

4
Exploration vs. Exploitation

Q-learning does not explicitly tell the agent
what to do
just computes a Q-function Qs,a that allows the
agent to see, for every state, which is the
action with the highest expected reward
Given a Q-function, there are two things that the
agent can do
Exploit the knowledge accumulated so far, and
chose the action that maximizes Qs,a in a
given state (greedy behavior)
Explore new actions, hoping to improve its
estimate of the optimal Q-function, i.e. do not
chose the action suggested by the current Qs,a

5
Exploration vs. Exploitation

Q-learning does not explicitly tell the agent
what to do
just computes a Q-function Qs,a that allows the
agent to see, for every state, which is the
action with the highest expected reward
Given a Q-function, there are two things that the
agent can do
Exploit the knowledge accumulated so far, and
chose the action that maximizes Qs,a in a
given state (greedy behavior)
Explore new actions, hoping to improve its
estimate of the optimal Q-function, i.e. do not
chose the action suggested by the current Qs,a
When to explore and when the exploit?
Never exploring may lead to being stuck in a
suboptimal course of actions
Exploring too much is a waste of the knowledge
accumulated via experience
Must find the right compromise

6
Exploration Strategies

Hard to come up with an optimal exploration
policy (problem is widely studied in statistical
decision theory)
But intuitively, any such strategy should be
greedy in the limit of infinite exploration
(GLIE), i.e.
Try each action an unbounded number of times, to
avoid the possibility of missing an optimal
action because of an unusually bad series of
outcomes (we discussed this before)
choose the predicted best action when, in the
limit, it has found the optimal value
function/policy
We will look at a few exploration strategies
e-greedy
soft-max
Optimism in the face of uncertainty

7
e-greedy

Choose a random action with probability e and
choose a best action with probability 1- e
Eventually converges to an optimal policy because
it ensures that the first GLIE condition (try
every action an unbounded number of times) is
satisfied via the e random selection
But it is rather slow, because it does not really
become fully greedy in the limit
It always chooses the non-optimal action with
probability e, while ideally you would want to
explore more at the beginning and become greedier
as estimates become more accurate
Possible solution is to vary e overtime

8
Soft-Max

Takes into account improvement in estimates of
expected reward function Qs,a
Choose action a in state s with a probability
proportional to current estimate of Qs,a

t in the formula above influences how randomly
values should be chosen
if t is high, the exponentials approach 1, the
fraction approaches 1/(number of actions), and
each action has approximately the same
probability of being chosen ( exploration or
exploitation?)
As t is reduced, actions with higher Qs,a are
more likely to be chosen
as t ? 0, the exponential with the highest Qs,a
dominates, and the best action is always chosen
(exploration or exploitation?)

9
Optimism under Uncertainty

Initialize Q-function to values that encourage
exploration
Amounts to giving an optimistic estimate for the
initial values of Qs,a
Make the agent believe that there are wonderful
rewards scattered all over, so that it explores
all over
May take a long time to converge
A state gets to look bad when all its actions
look bad
But when all actions lead to states that look
good, it takes a long time to retrieve realistic
Q-values via sheer exploration
Works fast only if the original values are a
close approximation of the final values
This strategy does not work when the dynamics of
the environment change over time.

10
Optimism under Uncertainty

Initialize Q-function to values that encourage
exploration
Amounts to giving an optimistic estimate for the
initial values of Qs,a
Make the agent believe that there are wonderful
rewards scattered all over, so that it explores
all over
May take a long time to converge
A state gets to look bad when all its actions
look bad
But when all actions lead to states that look
good, it takes a long time to retrieve realistic
Q-values via sheer exploration
Works fast only if the original values are a
close approximation of the final values
This strategy does not work when the dynamics of
the environment change over time.
Exploration happens only in the initial phases of
learning, as so it cant keep track of changes in
the environment.

11
Optimism under Uncertainty Revised

Another approach favor exploration of
rarely-tried actions, but stop pursuing them
after enough evidence that they have low utility
This can be done by defining an exploration
function
f(Qs,a, N(s,a))
where N(s,a) is the number of times a has
been tried in s
Determines how greed (preference for high values
of Q) is traded-off against curiosity (low values
of N)

12
Exploration Function

There are many such functions, here is a simple
one

This is the function that is used to select the
next action to try in the current state
R is an optimistic estimate of the best possible
reward obtainable in any state
Ne is a fixed parameter that forces the agent to
try each action at least these many times in each
state
After these many times, the agent stops relying
on the initial overestimates and uses the
potentially more accurate current Q values
This takes care of the problems of only relying
on optimistic estimates throughout the process.

13
Modified Q-learning
argmaxa(f(Qs,a,N(s.a))
obviously need to modify the code to have N(s,a)
14
Overview

Introduction
Q-learning
Exploration vs. Exploitation
Evaluating RL algorithms
On-Policy Learning SARSA
Model-based Q-learning

15
Evaluating RL Algorithms

Two possible measures
Quality of the optimal policy
Reward received while looking for the policy
If there is a lot of time for learning before the
agent is deployed, then quality of the learned
policy is the measure to consider
If the agent has to learn while being deployed,
it may not get to the optimal policy for a along
time
Reward received while learning is the measure to
look at, e.g, plot cumulative reward as a
function of number of steps
One algorithm dominates another if its plot is
consistently above

16
Evaluating RL Algorithms

Plots for example 11.7 in textbook (p. 480), with
Either fixed or variable a
Different initial values for Qs,a

17
Evaluating RL Algorithms

Lots of variability in each algorithm for
different runs
for fair comparison, run each algorithm several
times and report average behavior
Relevant statistics of the plot
Asymptotic slopes how good the policy is after
the algorithm stabilizes
Plot minimum how much reward must be sacrificed
before starting to gain (cost of learning)
zero-crossing how long it takes for the
algorithm to recuperate its cost of learning

18
Overview

Introduction
Q-learning
Exploration vs. Exploitation
Evaluating RL algorithms
On-Policy Learning SARSA
Model-based Q-learning

19
Learning before vs. during deployment

As we saw earlier, there are two possible modus
operandi for our learning agents
act in the environment to learn how it works,
i.e. to learn an optimal policy. Then use this
policy to act (there is a learning phase before
deployment)
Learn as you go, that is start operating in the
environment right away and learn from actions
(learning happens during deployment)
If there is time to learn before deployment, the
agent should try to do its best to learn as much
as possible about the environment
even engage in locally suboptimal behaviors,
because this will guarantee reaching an optimal
policy in the long run
If learning while at work, suboptimal behaviors
could be too costly

20
Example

Consider, for instance, our sample grid game
the optimal policy is to go up in S0
But if the agent includes some exploration in its
policy (e.g. selects 20 of its actions
randomly), exploring in S2 could be dangerous
because it may cause hitting the -100 wall
No big deal if the agent is not deployed yet, but
not ideal otherwise

Q-learning would not detect this problem
It does off-policy learning, i.e., it focuses on
the optimal policy
On-policy learning addresses this problem

21
On-policy learning SARSA

On-policy learning learns the value of the policy
being followed.
e.g., act greedily 80 of the time and act
randomly 20 of the time
Better to be aware of the consequences of
exploration has it happens, and avoid outcomes
that are too costly while acting, rather than
looking for the true optimal policy
SARSA
So called because it uses ltstate, action, reward,
state, actiongt experiences rather than the
ltstate, action, reward, stategt used by Q-learning
Instead of looking for the best action at every
step, it evaluates the actions suggested by the
current policy
Uses this info to revise it

22
On-policy learning SARSA

Given an experience lts,a,r,s,agt, SARSA updates
Qs,a as follows

Whats different from Q-learning?

23
On-policy learning SARSA

Given an experience lts ,a, r, s, agt, SARSA
updates Qs,a as follows

While Q-learning was using
There is no more MAX operator in the equation,
there is instead the Q-value of the action
suggested by the policy

24
On-policy learning SARSA

Does SARSA remind you of any other algorithm we
have seen before?

25
Policy Iteration

Algorithm
p ? an arbitrary initial policy, U ? A vector of
utility values, initially 0
2. Repeat until no change in p
Compute new utilities given p and current U
(policy evaluation)
(b) Update p as if utilities were correct (policy
improvement)

Expected value of following current ?i from s
Expected value of following another action in s
Policy Improvement step
26
k1
k1
Only immediate rewards are included in the
update, as with Q-learning
27
k1
k2
SARSA backs up the rewards of the next action,
rather then the max reward
28
Comparing SARSA and Q-learning

For the little 6-states world

Policy learned by Q-learning 80 greedy is to go
up in s0 to reach s4 quickly and get the big
10 reward

29
Comparing SARSA and Q-learning

Policy learned by SARSA 80 greedy is to go left
in s0
Safer because avoid the chance of getting the
-100 reward in s2
but non-optimal gt lower q-values

30
SARSA Algorithm

This could be, for instance any e-greedy
strategy
Choose random e times, and max the rest

This could be, for instance any e-greedy
strategy - Choose random e times, and max the
rest
If the random step is chosen here, and has a bad
negative reward, this will affect the value of
Qs,a. Next time in s, a may no longer be the
action selected because of its lowered Q value
31
Another Example

Gridworld with
Deterministic actions up, down, left, right
Start from S and arrive at G
Reward is -1 for all transitions, except those
into the region marked Cliff
Falling into the cliff causes the agent to be
sent back to start r -100

32
Another Example

Because of negative reward for every step taken,
the optimal policy over the four standard actions
is to take the shortest path along the cliff
But if the agents adopt an e-greedy action
selection strategy with e0.1, walking along the
cliff is dangerous
The optimal path that considers exploration is to
go around as far as possible from the cliff

33
Q-learning vs. SARSA

Q-learning learns the optimal policy, but because
it does so without taking exploration into
account, it does not do so well while the agent
is exploring
It occasionally falls into the cliff, so its
reward per episode is not that great
SARSA has better on-line performance (reward per
episode), because it learns to stay away from the
cliff while exploring
But note that if e?0, SARSA and Q-learning would
asymptotically converge to the optimal policy

34
Problem with Model-free methods

Q-learning and SARSA are model-free methods
What does this mean?

35
Problems With Model-free Methods

Q-learning and SARSA are model-free methods
They do not need to learn the transition and/or
reward model, they are implicitly taken into
account via experiences
Sounds handy, but there is a main disadvantage
How often does the agent get to update its
Q-estimates?

36
Problems with Model-free Methods

Q-learning and SARSA are model-free methods
They do not need to learn the transition and/or
reward model, they are implicitly taken into
account via experiences
Sounds handy, but there is a main disadvantage
How often does the agent get to update its
Q-estimates?
Only after a new experience comes in
Great if the agent acts very frequently, not so
great if actions are sparse, because it wastes
computation time

37
Model-based methods

Idea
learn the MDP and interleave acting and
planning.
After each experience,
update probabilities and the reward,
do some steps of value iteration (asynchronous )
to get better estimates of state utilities U(s)
given the current model and reward function
Remember that there is the following link between
Q values and utility values

38
VI algorithm
39
Asynchronous Value Iteration

The basic version of value iteration applies
the Bellman update to all states at every
iteration
This is in fact not necessary
On each iteration we can apply the update only to
a chosen subset of states
Given certain conditions on the value function
used to initialize the process, asynchronous
value iteration converges to an optimal policy

Main advantage
one can design heuristics that allow the
algorithm to concentrate on states that are
likely to belong to the optimal policy
Makes sense if I have no intention of ever doing
research in AI, there is no point in exploring
the resulting states
Much faster convergence

40
Asynchronous VI algorithm
for some
41
Model-based RL algorithm
controller Prioritized Sweeping inputs S is a
set of states, A is a set of actions, ? the
discount, c is prior count internal state real
array QS,A, RS,A, S integer array TS,A,
S previous state s previous action a
Assumes a reward function as general as possible,
i.e. depending on all of s,a,s
42
Counts of events when action a performed in s
generated s
TD-based estimate of R(s,a,s)
Asynchronous value iteration steps
What is this c for?
Why is the reward inside the summation?
Frequency of transition from s1 to s2 via a1
43
Discussion

Which states to update?
At least s in which the action was generated
Then either select states randomly, or
States that are likely to get their Q-values
changed because they can reach states with
Q-values that have changed the most
How many steps of asynchronous value-iteration to
perform?

44
Discussion

Which states to update?
At least s in which the action was generated
Then either select states randomly, or
States that are likely to get their Q-values
changed because they can reach states with
Q-values that have changed the most
How many steps of asynchronous value-iteration to
perform?
As many as can be done before having to act again

45
Q-learning vs. Model-based

Is it better to learn a model and a utility
function or an action value function with no
model?
Still an open-question
Model-based approaches require less data to learn
well, but they can be computationally more
expensive (time per iteration)
Q-learning takes longer because it does not
enforce consistency among Q-values via the model
Especially true when the environment becomes more
complex
In games such as chess and backgammon,
model-based approaches have been more successful
that q-learning methods
Cost/ease of acting needs to be factored in

46
Overview

Introduction
Q-learning
Exploration vs. Exploitation
Evaluating RL algorithms
On-Policy Learning SARSA
Model-Based Methods
Reinforcement Learning with Features

47
Problem with state-base methods

In all the Q-learning methods that we have seen,
the goal is to fill out the State-Action matrix
with good Q-values
In order to do that, we need to make sure that
the agents experiences visit all the states
Problem?

48
Problem with state-base methods

Model-based variations have been shown to handle
reasonably well spaces with 10,000 states
Two dimensional maze-like environments
The real world is much more complex than that
Chess contains in the order of 105 states
Backgammon in the order of 10120 states
Unfeasible to visit all of them to learn how to
play the game!
Additional problem with state-based methods,
information about one state cannot be used by
similar states.
In order to do that, we need to make sure that
the agents experiences visit all the states
Problem?

49
Alternative Approach

If we have more knowledge about the world
Approximate the Q-function using a function of
state/action features
Most typical is a linear function of the
features.
A linear function of variables X1, ., Xn is of
the form

50
What are these features?

They are properties of the world states and
actions that may be relevant to perform well in
the world
However, if and how are relevant is not clear
enough to create well defined rules of action
(policies)
For instance, possible features in chess would be
Approximate material value of each piece (pawn
has 1, bishop has 3, knight has 5, queen has 9)
King safety
Good pawn structure
Expert players have heuristics to use these
features for successful playing, but they cannot
be formalized in machine-ready ways

51
SARSA with Linear Function Approximation

Suppose that F1, ., Fn are features of states
and actions in our world
If a new experience lts, a, r, s, agt is
observed, it provides a new value to update Q(s,a)

52
SARSA with Linear Function Approximation

We use this experience to adjust the parameters
w1,..,wn so as to minimize the squared-error
Does it remind you of anything we have already
seen?

53
Gradient descent

So we have an expression of the error over Q(s,a)
as a linear function of the parameters w1,..,wn
We want to minimize it
Gradient descent
To find the minimum of a real-valued function
f(x1,.., x1)
Assign arbitrary values to x1,.., x1,
then repeat for each xi

54
Gradient Descent Search
each set of weights defines a point on the error
surface
Given a point on the surface, look at the slope
of the surface along the axis formed by each
weight partial derivative of the surface Err
with respect to each weight wj
55
SARSA with Linear Function Approximation

If we set

56
Algorithm
Error over Q(s,a)
Parameter adjustment via gradient descent
57
Example

25 grid locations
prize could be at one of the corners or no prize.
If no prize, for each time step there is a
probability that a prize appears at one of the
corners.
Landing on prize gives reward of 10 and the
prize disappears.

Monsters can appear at any time at one of the
locations marked M.
The agent gets damaged if a monster appears at
the square the agent is on.
If the agent is already damaged, it receives a
reward of -10.
The agent can get repaired by visiting the repair
station marked R.

4 actions up, down, left and right.
These move the agent one step, usually in the
direction indicated by the name.
but sometimes in one of the other directions.
If the agent crashes into an outside wall or one
of the interior walls (the thick lines near the
location R), it remains where is was and receives
a reward of -1.

State consists of 4 components ltX,Y,P,Dgt,
X is the X-coordinate of the agent,
Y is the Y-coordinate of the agent,
P is the position of the prize
(Pi if there is a prize at Pi, i),
D is Boolean and is true when the agent is damaged

As the monsters are transient, there is no need
to include them as part of the state.
There are thus 5552 250 states.
The agent does not know any of the story given
here.
It just knows that there are 250 states, and 4
actions, and which state it is in at every time
and what reward was received at each time.
This game is difficult to learn
Visiting R is seemingly innocuous, until the
agent has determined that being damaged is bad,
and that visiting R makes it not damaged.
It needs to stumble upon this while trying to
collect the prizes.
The states where there is no prize available do
not last very long. Moreover, it has to learn
this without being given the concept of damaged

60
Feature-based representation

F1(s,a) 1 action a would most likely take the
agent from state s into a location where a
monster could appear, and 0 otherwise.
F2(s,a) 1 if action a would most likely take the
agent into a wall and 0 otherwise.
F3(s,a) has value 1 if the step a would most
likely take the agent towards a prize.
F4(s,a) has value 1 if the agent is damaged in
state s and action a takes it towards the repair
station.
F5(s,a) has value 1 if the agent is damaged and
action a would most likely take the agent into a
location where a monster could appear, and 0
otherwise.
same as F1(s,a), but is only applicable when the
agent is damaged.
F6(s,a) has value 1 if the agent is damaged in
state s and has value 0 otherwise.
F7(s,a) has value 1 if the agent is not damaged
in state s and has value 0 otherwise.
F8(s,a) has value 1 if the agent is damaged and
there is a prize ahead in direction a
F9(s,a) has value 1 if the agent is not damaged
and there is a prize ahead in direction a

61
Feature-based representation

F10(s,a) has the value of the x value in state s
if there is a prize at location P0 in state s
distance from the left wall if there is a prize
at location P0
F11(s,a) has the value 4-x where x is the
horizontal position in state s if there is a
prize at location P0 in state s.
Distance from the right wall if there is a prize
at location P0.
F12(s,a) to F29(s,a) are like F10 and F11 for
different combinations of the prize location and
the distance from each of the 4 walls.
For the case where the prize is at location P0,
the y distance could take into account the wall.
http//www.cs.ubc.ca/spider/poole/demos/rl/sGameFA
.html

62
Discussion

Finding the right features is difficult
The author of TD-Gammon, a program that uses RL
to learn to play Backgammon, took over 5 years to
come up with a reasonable set of features
Reached performance of three top players
worldwide

Write a Comment

User Comments (0)