Chapter 5: Monte Carlo Methods

About This Presentation

Title:

Chapter 5: Monte Carlo Methods

Description:

Chapter 5: Monte Carlo Methods Monte Carlo methods learn from complete sample returns Only defined for episodic tasks Monte Carlo methods learn directly from experience – PowerPoint PPT presentation

Number of Views:215

Avg rating:3.0/5.0

Slides: 22

Provided by: AndyB200

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 5: Monte Carlo Methods

1
Chapter 5 Monte Carlo Methods

Monte Carlo methods learn from complete sample
returns
Only defined for episodic tasks
Monte Carlo methods learn directly from
experience
On-line No model necessary and still attains
optimality
Simulated No need for a full model

2
Monte Carlo Policy Evaluation

Goal learn Vp(s)
Given some number of episodes under p which
contain s
Idea Average returns observed after visits to s

Every-Visit MC average returns for every time s
is visited in an episode
First-visit MC average returns only for first
time s is visited in an episode
Both converge asymptotically

3
First-visit Monte Carlo Policy Evaluation
Initialize p policy to be evaluated V an
arbitrary state-value function Returns(s) empty
list, for all seS Repeat forever Generate an
episode using p For each state s appearing in the
episode R return following the first occurrence
of s Append R to Return(s) V(s) average(Returns(s)
)
4
Blackjack example

Object Have your card sum be greater than the
dealers without exceeding 21.
States (200 of them)
current sum (12-21)
dealers showing card (ace-10)
do I have a useable ace?
Reward 1 for winning, 0 for a draw, -1 for
losing
Actions stick (stop receiving cards), hit
(receive another card)
Policy Stick if my sum is 20 or 21, else hit

5
Blackjack Value Functions

After many MC state visit evaluations the
state-value function is well approximated
Dynamic Programming parameters difficult to
formulate here!
For instance with a given hand and a decision to
stay what would be the expected return value?

6
Backup Diagram for Monte Carlo

Entire episode is included while in DP only one
step transitions
Only one choice at each state (unlike DP which
uses all possible transitions in one step)
Estimates for all states are independent so MC
does not bootstrap (build on other estimates)
Time required to estimate one state does not
depend on the total number of states

7
The Power of Monte Carlo
Example - Elastic Membrane (Dirichlet Problem)
How do we compute the shape of the membrane or
bubble attached to a fixed frame?
8
Two Approaches
Relaxation Iterate on the grid and compute
averages (like DP iterations)
Kakutanis algorithm, 1945 Use many random walks
and average the boundary points values (like MC
approach)
9
Monte Carlo Estimation of Action Values (Q)

Monte Carlo is most useful when a model is not
available
We want to learn Q
Qp(s,a) - average return starting from state s
and action a following p
Converges asymptotically if every state-action
pair is visited
To assure this we must maintain exploration to
visit many state-action pairs
Exploring starts Every state-action pair has a
non-zero probability of being the starting pair

10
Monte Carlo Control (to approximate an optimal
policy)
Generalized Policy Iteration GPI

MC policy iteration Policy evaluation by
approximating Qp using MC methods, followed by
policy improvement
Policy improvement step greedy with respect to
Qp (action-value) function no model needed to
construct greedy policy

11
Convergence of MC Control

Policy improvement theorem tells us

This assumes exploring starts and infinite number
of episodes for MC policy evaluation
To solve the latter
update only to a given level of performance
alternate between evaluation and improvement per
episode

12
Monte Carlo Exploring Starts
Fixed point is optimal policy p Proof is one of
the fundamental questions of RL
13
Blackjack Example continued

Exploring starts easy to enforce by generating
state-action pairs randomly
Initial policy as described before (sticks only
at 20 or 21)
Initial action-value function equal to zero

14
On-policy Monte Carlo Control

On-policy learn or improve the policy currently
executing
How do we get rid of exploring starts?
Need soft policies p(s,a) gt 0 for all s and a
e.g. e-soft policy
probability of action selection

Similar to GPI move policy towards greedy policy
(i.e. e-soft)
Converges to best e-soft policy

15
On-policy MC Control
16
Learning about p while following another
17
Off-policy Monte Carlo control

On-policy estimates the policy value while
following it
so the policy to generate behavior and estimate
its result is identical
Off-policy assumes separate behavior and
estimation policies
Behavior policy generates behavior in environment
may be randomized to sample all actions
Estimation policy is policy being learned about
may be deterministic (greedy)
Off-policy estimates one policy while following
another
Average returns from behavior policy by using
nonzero probabilities of selecting all actions
for the estimation policy
May be slow to find improvement after nongreedy
action

18
Off-policy MC control
19
Incremental Implementation

MC can be implemented incrementally
saves memory
Compute the weighted average of each return

incremental equivalent
non-incremental
20
Racetrack Exercise

States grid squares, velocity horizontal and
vertical
Rewards -1 on track, -5 off track
Only the right turns allowed

Actions 1, -1, 0 to velocity
0 lt Velocity lt 5
Stochastic 50 of the time it moves 1 extra
square up or right

21
Summary

MC has several advantages over DP
Can learn directly from interaction with
environment
No need for full models
No need to learn about ALL states
Less harm by Markovian violations (later in book)
MC methods provide an alternate policy evaluation
process
One issue to watch for maintaining sufficient
exploration
exploring starts, soft policies
No bootstrapping (as opposed to DP)

Write a Comment

User Comments (0)

About PowerShow.com

Chapter 5: Monte Carlo Methods - PowerPoint PPT Presentation

Chapter 5: Monte Carlo Methods

Chapter 5: Monte Carlo Methods Monte Carlo methods learn from complete sample returns Only defined for episodic tasks Monte Carlo methods learn directly from experience – PowerPoint PPT presentation