Title: Chapter 5: Monte Carlo Methods
1Chapter 5 Monte Carlo Methods
- Monte Carlo methods learn from complete sample
returns - Only defined for episodic tasks
- Monte Carlo methods learn directly from
experience - On-line No model necessary and still attains
optimality - Simulated No need for a full model
2Monte Carlo Policy Evaluation
- Goal learn Vp(s)
- Given some number of episodes under p which
contain s - Idea Average returns observed after visits to s
- Every-Visit MC average returns for every time s
is visited in an episode - First-visit MC average returns only for first
time s is visited in an episode - Both converge asymptotically
3First-visit Monte Carlo Policy Evaluation
Initialize p policy to be evaluated V an
arbitrary state-value function Returns(s) empty
list, for all seS Repeat forever Generate an
episode using p For each state s appearing in the
episode R return following the first occurrence
of s Append R to Return(s) V(s) average(Returns(s)
)
4Blackjack example
- Object Have your card sum be greater than the
dealers without exceeding 21. - States (200 of them)
- current sum (12-21)
- dealers showing card (ace-10)
- do I have a useable ace?
- Reward 1 for winning, 0 for a draw, -1 for
losing - Actions stick (stop receiving cards), hit
(receive another card) - Policy Stick if my sum is 20 or 21, else hit
5Blackjack Value Functions
- After many MC state visit evaluations the
state-value function is well approximated - Dynamic Programming parameters difficult to
formulate here! - For instance with a given hand and a decision to
stay what would be the expected return value?
6Backup Diagram for Monte Carlo
- Entire episode is included while in DP only one
step transitions - Only one choice at each state (unlike DP which
uses all possible transitions in one step) - Estimates for all states are independent so MC
does not bootstrap (build on other estimates) - Time required to estimate one state does not
depend on the total number of states
7The Power of Monte Carlo
Example - Elastic Membrane (Dirichlet Problem)
How do we compute the shape of the membrane or
bubble attached to a fixed frame?
8Two Approaches
Relaxation Iterate on the grid and compute
averages (like DP iterations)
Kakutanis algorithm, 1945 Use many random walks
and average the boundary points values (like MC
approach)
9Monte Carlo Estimation of Action Values (Q)
- Monte Carlo is most useful when a model is not
available - We want to learn Q
- Qp(s,a) - average return starting from state s
and action a following p - Converges asymptotically if every state-action
pair is visited - To assure this we must maintain exploration to
visit many state-action pairs - Exploring starts Every state-action pair has a
non-zero probability of being the starting pair
10Monte Carlo Control (to approximate an optimal
policy)
Generalized Policy Iteration GPI
- MC policy iteration Policy evaluation by
approximating Qp using MC methods, followed by
policy improvement - Policy improvement step greedy with respect to
Qp (action-value) function no model needed to
construct greedy policy
11Convergence of MC Control
- Policy improvement theorem tells us
- This assumes exploring starts and infinite number
of episodes for MC policy evaluation - To solve the latter
- update only to a given level of performance
- alternate between evaluation and improvement per
episode
12Monte Carlo Exploring Starts
Fixed point is optimal policy p Proof is one of
the fundamental questions of RL
13Blackjack Example continued
- Exploring starts easy to enforce by generating
state-action pairs randomly - Initial policy as described before (sticks only
at 20 or 21) - Initial action-value function equal to zero
14On-policy Monte Carlo Control
- On-policy learn or improve the policy currently
executing - How do we get rid of exploring starts?
- Need soft policies p(s,a) gt 0 for all s and a
- e.g. e-soft policy
- probability of action selection
- Similar to GPI move policy towards greedy policy
(i.e. e-soft) - Converges to best e-soft policy
15On-policy MC Control
16Learning about p while following another
17Off-policy Monte Carlo control
- On-policy estimates the policy value while
following it - so the policy to generate behavior and estimate
its result is identical - Off-policy assumes separate behavior and
estimation policies - Behavior policy generates behavior in environment
- may be randomized to sample all actions
- Estimation policy is policy being learned about
- may be deterministic (greedy)
- Off-policy estimates one policy while following
another - Average returns from behavior policy by using
nonzero probabilities of selecting all actions
for the estimation policy - May be slow to find improvement after nongreedy
action
18Off-policy MC control
19Incremental Implementation
- MC can be implemented incrementally
- saves memory
- Compute the weighted average of each return
incremental equivalent
non-incremental
20Racetrack Exercise
- States grid squares, velocity horizontal and
vertical - Rewards -1 on track, -5 off track
- Only the right turns allowed
- Actions 1, -1, 0 to velocity
- 0 lt Velocity lt 5
- Stochastic 50 of the time it moves 1 extra
square up or right
21Summary
- MC has several advantages over DP
- Can learn directly from interaction with
environment - No need for full models
- No need to learn about ALL states
- Less harm by Markovian violations (later in book)
- MC methods provide an alternate policy evaluation
process - One issue to watch for maintaining sufficient
exploration - exploring starts, soft policies
- No bootstrapping (as opposed to DP)