Title: Reinforcement Learning
1Reinforcement Learning
2Overview
- Introduction
- Q-learning
- Exploration Exploitation
- Evaluating RL algorithms
- On-Policy learning SARSA
- Model-based Q-learning
3What Does Q-Learning learn
- Does Q-learning gives the agent an optimal
policy?
4Exploration vs. Exploitation
- Q-learning does not explicitly tell the agent
what to do - just computes a Q-function Qs,a that allows the
agent to see, for every state, which is the
action with the highest expected reward - Given a Q-function, there are two things that the
agent can do - Exploit the knowledge accumulated so far, and
chose the action that maximizes Qs,a in a
given state (greedy behavior) - Explore new actions, hoping to improve its
estimate of the optimal Q-function, i.e. do not
chose the action suggested by the current Qs,a
5Exploration vs. Exploitation
- Q-learning does not explicitly tell the agent
what to do - just computes a Q-function Qs,a that allows the
agent to see, for every state, which is the
action with the highest expected reward - Given a Q-function, there are two things that the
agent can do - Exploit the knowledge accumulated so far, and
chose the action that maximizes Qs,a in a
given state (greedy behavior) - Explore new actions, hoping to improve its
estimate of the optimal Q-function, i.e. do not
chose the action suggested by the current Qs,a - When to explore and when the exploit?
- Never exploring may lead to being stuck in a
suboptimal course of actions - Exploring too much is a waste of the knowledge
accumulated via experience - Must find the right compromise
6Exploration Strategies
- Hard to come up with an optimal exploration
policy (problem is widely studied in statistical
decision theory) - But intuitively, any such strategy should be
greedy in the limit of infinite exploration
(GLIE), i.e. - Try each action an unbounded number of times, to
avoid the possibility of missing an optimal
action because of an unusually bad series of
outcomes (we discussed this before) - choose the predicted best action when, in the
limit, it has found the optimal value
function/policy - We will look at a few exploration strategies
- e-greedy
- soft-max
- Optimism in the face of uncertainty
7e-greedy
- Choose a random action with probability e and
choose a best action with probability 1- e - Eventually converges to an optimal policy because
it ensures that the first GLIE condition (try
every action an unbounded number of times) is
satisfied via the e random selection - But it is rather slow, because it does not really
become fully greedy in the limit - It always chooses the non-optimal action with
probability e, while ideally you would want to
explore more at the beginning and become greedier
as estimates become more accurate - Possible solution is to vary e overtime
8Soft-Max
- Takes into account improvement in estimates of
expected reward function Qs,a - Choose action a in state s with a probability
proportional to current estimate of Qs,a
- t in the formula above influences how randomly
values should be chosen - if t is high, the exponentials approach 1, the
fraction approaches 1/(number of actions), and
each action has approximately the same
probability of being chosen ( exploration or
exploitation?) - As t is reduced, actions with higher Qs,a are
more likely to be chosen - as t ? 0, the exponential with the highest Qs,a
dominates, and the best action is always chosen
(exploration or exploitation?)
9Optimism under Uncertainty
- Initialize Q-function to values that encourage
exploration - Amounts to giving an optimistic estimate for the
initial values of Qs,a - Make the agent believe that there are wonderful
rewards scattered all over, so that it explores
all over - May take a long time to converge
- A state gets to look bad when all its actions
look bad - But when all actions lead to states that look
good, it takes a long time to retrieve realistic
Q-values via sheer exploration - Works fast only if the original values are a
close approximation of the final values - This strategy does not work when the dynamics of
the environment change over time.
10Optimism under Uncertainty
- Initialize Q-function to values that encourage
exploration - Amounts to giving an optimistic estimate for the
initial values of Qs,a - Make the agent believe that there are wonderful
rewards scattered all over, so that it explores
all over - May take a long time to converge
- A state gets to look bad when all its actions
look bad - But when all actions lead to states that look
good, it takes a long time to retrieve realistic
Q-values via sheer exploration - Works fast only if the original values are a
close approximation of the final values - This strategy does not work when the dynamics of
the environment change over time. - Exploration happens only in the initial phases of
learning, as so it cant keep track of changes in
the environment.
11Optimism under Uncertainty Revised
- Another approach favor exploration of
rarely-tried actions, but stop pursuing them
after enough evidence that they have low utility - This can be done by defining an exploration
function - f(Qs,a, N(s,a))
- where N(s,a) is the number of times a has
been tried in s - Determines how greed (preference for high values
of Q) is traded-off against curiosity (low values
of N)
12Exploration Function
- There are many such functions, here is a simple
one
- This is the function that is used to select the
next action to try in the current state - R is an optimistic estimate of the best possible
reward obtainable in any state - Ne is a fixed parameter that forces the agent to
try each action at least these many times in each
state - After these many times, the agent stops relying
on the initial overestimates and uses the
potentially more accurate current Q values - This takes care of the problems of only relying
on optimistic estimates throughout the process.
13Modified Q-learning
argmaxa(f(Qs,a,N(s.a))
obviously need to modify the code to have N(s,a)
14Overview
- Introduction
- Q-learning
- Exploration vs. Exploitation
- Evaluating RL algorithms
- On-Policy Learning SARSA
- Model-based Q-learning
15Evaluating RL Algorithms
- Two possible measures
- Quality of the optimal policy
- Reward received while looking for the policy
- If there is a lot of time for learning before the
agent is deployed, then quality of the learned
policy is the measure to consider - If the agent has to learn while being deployed,
it may not get to the optimal policy for a along
time - Reward received while learning is the measure to
look at, e.g, plot cumulative reward as a
function of number of steps - One algorithm dominates another if its plot is
consistently above
16Evaluating RL Algorithms
- Plots for example 11.7 in textbook (p. 480), with
- Either fixed or variable a
- Different initial values for Qs,a
17Evaluating RL Algorithms
- Lots of variability in each algorithm for
different runs - for fair comparison, run each algorithm several
times and report average behavior - Relevant statistics of the plot
- Asymptotic slopes how good the policy is after
the algorithm stabilizes - Plot minimum how much reward must be sacrificed
before starting to gain (cost of learning) - zero-crossing how long it takes for the
algorithm to recuperate its cost of learning
18Overview
- Introduction
- Q-learning
- Exploration vs. Exploitation
- Evaluating RL algorithms
- On-Policy Learning SARSA
- Model-based Q-learning
19Learning before vs. during deployment
- As we saw earlier, there are two possible modus
operandi for our learning agents - act in the environment to learn how it works,
i.e. to learn an optimal policy. Then use this
policy to act (there is a learning phase before
deployment) - Learn as you go, that is start operating in the
environment right away and learn from actions
(learning happens during deployment) - If there is time to learn before deployment, the
agent should try to do its best to learn as much
as possible about the environment - even engage in locally suboptimal behaviors,
because this will guarantee reaching an optimal
policy in the long run - If learning while at work, suboptimal behaviors
could be too costly
20Example
- Consider, for instance, our sample grid game
- the optimal policy is to go up in S0
- But if the agent includes some exploration in its
policy (e.g. selects 20 of its actions
randomly), exploring in S2 could be dangerous
because it may cause hitting the -100 wall - No big deal if the agent is not deployed yet, but
not ideal otherwise
- Q-learning would not detect this problem
- It does off-policy learning, i.e., it focuses on
the optimal policy - On-policy learning addresses this problem
21On-policy learning SARSA
- On-policy learning learns the value of the policy
being followed. - e.g., act greedily 80 of the time and act
randomly 20 of the time - Better to be aware of the consequences of
exploration has it happens, and avoid outcomes
that are too costly while acting, rather than
looking for the true optimal policy - SARSA
- So called because it uses ltstate, action, reward,
state, actiongt experiences rather than the
ltstate, action, reward, stategt used by Q-learning - Instead of looking for the best action at every
step, it evaluates the actions suggested by the
current policy - Uses this info to revise it
22On-policy learning SARSA
- Given an experience lts,a,r,s,agt, SARSA updates
Qs,a as follows
- Whats different from Q-learning?
23On-policy learning SARSA
- Given an experience lts ,a, r, s, agt, SARSA
updates Qs,a as follows
- While Q-learning was using
- There is no more MAX operator in the equation,
there is instead the Q-value of the action
suggested by the policy
24On-policy learning SARSA
- Does SARSA remind you of any other algorithm we
have seen before?
25Policy Iteration
- Algorithm
- p ? an arbitrary initial policy, U ? A vector of
utility values, initially 0 - 2. Repeat until no change in p
- Compute new utilities given p and current U
(policy evaluation) - (b) Update p as if utilities were correct (policy
improvement)
Expected value of following current ?i from s
Expected value of following another action in s
Policy Improvement step
26k1
k1
Only immediate rewards are included in the
update, as with Q-learning
27k1
k2
SARSA backs up the rewards of the next action,
rather then the max reward
28Comparing SARSA and Q-learning
- For the little 6-states world
- Policy learned by Q-learning 80 greedy is to go
up in s0 to reach s4 quickly and get the big
10 reward
29Comparing SARSA and Q-learning
- Policy learned by SARSA 80 greedy is to go left
in s0 - Safer because avoid the chance of getting the
-100 reward in s2 - but non-optimal gt lower q-values
30SARSA Algorithm
- This could be, for instance any e-greedy
strategy - Choose random e times, and max the rest
This could be, for instance any e-greedy
strategy - Choose random e times, and max the
rest
If the random step is chosen here, and has a bad
negative reward, this will affect the value of
Qs,a. Next time in s, a may no longer be the
action selected because of its lowered Q value
31Another Example
- Gridworld with
- Deterministic actions up, down, left, right
- Start from S and arrive at G
- Reward is -1 for all transitions, except those
into the region marked Cliff - Falling into the cliff causes the agent to be
sent back to start r -100
32Another Example
- Because of negative reward for every step taken,
the optimal policy over the four standard actions
is to take the shortest path along the cliff - But if the agents adopt an e-greedy action
selection strategy with e0.1, walking along the
cliff is dangerous - The optimal path that considers exploration is to
go around as far as possible from the cliff
33Q-learning vs. SARSA
- Q-learning learns the optimal policy, but because
it does so without taking exploration into
account, it does not do so well while the agent
is exploring - It occasionally falls into the cliff, so its
reward per episode is not that great - SARSA has better on-line performance (reward per
episode), because it learns to stay away from the
cliff while exploring - But note that if e?0, SARSA and Q-learning would
asymptotically converge to the optimal policy
34Problem with Model-free methods
- Q-learning and SARSA are model-free methods
- What does this mean?
35Problems With Model-free Methods
- Q-learning and SARSA are model-free methods
- They do not need to learn the transition and/or
reward model, they are implicitly taken into
account via experiences - Sounds handy, but there is a main disadvantage
- How often does the agent get to update its
Q-estimates?
36Problems with Model-free Methods
- Q-learning and SARSA are model-free methods
- They do not need to learn the transition and/or
reward model, they are implicitly taken into
account via experiences - Sounds handy, but there is a main disadvantage
- How often does the agent get to update its
Q-estimates? - Only after a new experience comes in
- Great if the agent acts very frequently, not so
great if actions are sparse, because it wastes
computation time
37Model-based methods
- Idea
- learn the MDP and interleave acting and
planning. - After each experience,
- update probabilities and the reward,
- do some steps of value iteration (asynchronous )
to get better estimates of state utilities U(s)
given the current model and reward function - Remember that there is the following link between
Q values and utility values
38VI algorithm
39Asynchronous Value Iteration
- The basic version of value iteration applies
the Bellman update to all states at every
iteration - This is in fact not necessary
- On each iteration we can apply the update only to
a chosen subset of states - Given certain conditions on the value function
used to initialize the process, asynchronous
value iteration converges to an optimal policy
- Main advantage
- one can design heuristics that allow the
algorithm to concentrate on states that are
likely to belong to the optimal policy - Makes sense if I have no intention of ever doing
research in AI, there is no point in exploring
the resulting states - Much faster convergence
40Asynchronous VI algorithm
for some
41Model-based RL algorithm
controller Prioritized Sweeping inputs S is a
set of states, A is a set of actions, ? the
discount, c is prior count internal state real
array QS,A, RS,A, S integer array TS,A,
S previous state s previous action a
Assumes a reward function as general as possible,
i.e. depending on all of s,a,s
42Counts of events when action a performed in s
generated s
TD-based estimate of R(s,a,s)
Asynchronous value iteration steps
What is this c for?
Why is the reward inside the summation?
Frequency of transition from s1 to s2 via a1
43Discussion
- Which states to update?
- At least s in which the action was generated
- Then either select states randomly, or
- States that are likely to get their Q-values
changed because they can reach states with
Q-values that have changed the most - How many steps of asynchronous value-iteration to
perform?
44Discussion
- Which states to update?
- At least s in which the action was generated
- Then either select states randomly, or
- States that are likely to get their Q-values
changed because they can reach states with
Q-values that have changed the most - How many steps of asynchronous value-iteration to
perform? - As many as can be done before having to act again
45Q-learning vs. Model-based
- Is it better to learn a model and a utility
function or an action value function with no
model? - Still an open-question
- Model-based approaches require less data to learn
well, but they can be computationally more
expensive (time per iteration) - Q-learning takes longer because it does not
enforce consistency among Q-values via the model - Especially true when the environment becomes more
complex - In games such as chess and backgammon,
model-based approaches have been more successful
that q-learning methods - Cost/ease of acting needs to be factored in
46Overview
- Introduction
- Q-learning
- Exploration vs. Exploitation
- Evaluating RL algorithms
- On-Policy Learning SARSA
- Model-Based Methods
- Reinforcement Learning with Features
47Problem with state-base methods
- In all the Q-learning methods that we have seen,
the goal is to fill out the State-Action matrix
with good Q-values - In order to do that, we need to make sure that
the agents experiences visit all the states - Problem?
48Problem with state-base methods
- Model-based variations have been shown to handle
reasonably well spaces with 10,000 states - Two dimensional maze-like environments
- The real world is much more complex than that
- Chess contains in the order of 105 states
- Backgammon in the order of 10120 states
- Unfeasible to visit all of them to learn how to
play the game! - Additional problem with state-based methods,
- information about one state cannot be used by
similar states. - In order to do that, we need to make sure that
the agents experiences visit all the states - Problem?
49Alternative Approach
- If we have more knowledge about the world
- Approximate the Q-function using a function of
state/action features - Most typical is a linear function of the
features. - A linear function of variables X1, ., Xn is of
the form
50What are these features?
- They are properties of the world states and
actions that may be relevant to perform well in
the world - However, if and how are relevant is not clear
enough to create well defined rules of action
(policies) - For instance, possible features in chess would be
- Approximate material value of each piece (pawn
has 1, bishop has 3, knight has 5, queen has 9) - King safety
- Good pawn structure
- Expert players have heuristics to use these
features for successful playing, but they cannot
be formalized in machine-ready ways
51SARSA with Linear Function Approximation
- Suppose that F1, ., Fn are features of states
and actions in our world - If a new experience lts, a, r, s, agt is
observed, it provides a new value to update Q(s,a)
52SARSA with Linear Function Approximation
- We use this experience to adjust the parameters
w1,..,wn so as to minimize the squared-error - Does it remind you of anything we have already
seen?
53Gradient descent
- So we have an expression of the error over Q(s,a)
as a linear function of the parameters w1,..,wn - We want to minimize it
- Gradient descent
- To find the minimum of a real-valued function
f(x1,.., x1) - Assign arbitrary values to x1,.., x1,
- then repeat for each xi
54Gradient Descent Search
each set of weights defines a point on the error
surface
Given a point on the surface, look at the slope
of the surface along the axis formed by each
weight partial derivative of the surface Err
with respect to each weight wj
55SARSA with Linear Function Approximation
56Algorithm
Error over Q(s,a)
Parameter adjustment via gradient descent
57Example
- 25 grid locations
- prize could be at one of the corners or no prize.
- If no prize, for each time step there is a
probability that a prize appears at one of the
corners. - Landing on prize gives reward of 10 and the
prize disappears.
- Monsters can appear at any time at one of the
locations marked M. - The agent gets damaged if a monster appears at
the square the agent is on. - If the agent is already damaged, it receives a
reward of -10. - The agent can get repaired by visiting the repair
station marked R.
58- 4 actions up, down, left and right.
- These move the agent one step, usually in the
direction indicated by the name. - but sometimes in one of the other directions.
- If the agent crashes into an outside wall or one
of the interior walls (the thick lines near the
location R), it remains where is was and receives
a reward of -1.
59- State consists of 4 components ltX,Y,P,Dgt,
- X is the X-coordinate of the agent,
- Y is the Y-coordinate of the agent,
- P is the position of the prize
- (Pi if there is a prize at Pi, i),
- D is Boolean and is true when the agent is damaged
- As the monsters are transient, there is no need
to include them as part of the state. - There are thus 5552 250 states.
- The agent does not know any of the story given
here. - It just knows that there are 250 states, and 4
actions, and which state it is in at every time
and what reward was received at each time. - This game is difficult to learn
- Visiting R is seemingly innocuous, until the
agent has determined that being damaged is bad,
and that visiting R makes it not damaged. - It needs to stumble upon this while trying to
collect the prizes. - The states where there is no prize available do
not last very long. Moreover, it has to learn
this without being given the concept of damaged
60Feature-based representation
- F1(s,a) 1 action a would most likely take the
agent from state s into a location where a
monster could appear, and 0 otherwise. - F2(s,a) 1 if action a would most likely take the
agent into a wall and 0 otherwise. - F3(s,a) has value 1 if the step a would most
likely take the agent towards a prize. - F4(s,a) has value 1 if the agent is damaged in
state s and action a takes it towards the repair
station. - F5(s,a) has value 1 if the agent is damaged and
action a would most likely take the agent into a
location where a monster could appear, and 0
otherwise. - same as F1(s,a), but is only applicable when the
agent is damaged. - F6(s,a) has value 1 if the agent is damaged in
state s and has value 0 otherwise. - F7(s,a) has value 1 if the agent is not damaged
in state s and has value 0 otherwise. - F8(s,a) has value 1 if the agent is damaged and
there is a prize ahead in direction a - F9(s,a) has value 1 if the agent is not damaged
and there is a prize ahead in direction a
61Feature-based representation
- F10(s,a) has the value of the x value in state s
if there is a prize at location P0 in state s - distance from the left wall if there is a prize
at location P0 - F11(s,a) has the value 4-x where x is the
horizontal position in state s if there is a
prize at location P0 in state s. - Distance from the right wall if there is a prize
at location P0. - F12(s,a) to F29(s,a) are like F10 and F11 for
different combinations of the prize location and
the distance from each of the 4 walls. - For the case where the prize is at location P0,
the y distance could take into account the wall. - http//www.cs.ubc.ca/spider/poole/demos/rl/sGameFA
.html
62Discussion
- Finding the right features is difficult
- The author of TD-Gammon, a program that uses RL
to learn to play Backgammon, took over 5 years to
come up with a reasonable set of features - Reached performance of three top players
worldwide