Title: Policy Gradient in Continuous Time
1Policy Gradient in Continuous Time
by Remi Munos, JMLR 2006
Presented by Hui Li Duke University Machine
Learning Group May 30, 2007
2Outline
- Introduction
- Discretized Stochastic Processes Approximation
- Model-free Reinforcement Learning (RL)
- algorithm
- Example Results
3Introduction of the Problem
- Consider an optimal control problem with
continuous state
System dynamics
- Deterministic process
- Continuous state
- Objective Find an optimal control (ut) that
maximize the functional
Objective function
4Introduction of the Problem
- Consider a class of parameterized policies ??
with
- Find parameter ? that maximize the performance
measure
- Standard approach is to use gradient ascent
method
object of the paper
5Introduction of the Problem
How to compute
This method requires a large number of
trajectories to compute the gradient of
performance measure.
- Pathwise estimation of the gradient
Compute the gradient using one trajectory only
6Introduction of the Problem
Pathwise estimation of the gradient
unknown
known
- In the reinforcement learning, is
unknown. How to approximate zt?
7Discretized Stochastic Processes Approximation
- A General Convergence Result
If
8- Discretization of the state
- Stochastic discrete state process
Initialization
Jump in state
9Proof of proposition 5
From Taylors formula
The average jump
Directly apply the Theorem 3, proposition 5 is
proved.
10- Discretization of the state gradient
- Stochastic discrete state gradient process
Initialization
With
11Proof of proposition 6
Since
then
Directly apply the Theorem 3, proposition 6 is
proved.
12Model-free Reinforcement Learning Algorithm
Let
In this stochastic approximation,
is observed, and
is given, we only need to approximate
13Least-Square Approximation of
Define
The set of past discrete times t-c??s? t when
action ut have been taken.
From Taylors formula, for all discrete time s,
We deduce
14Where
We may derive an approximation of
by solving the least-square problem
Then we have
Here
denote the average value of
15Algorithm
16Experimental Results
Six continuous state x0, y0 hand position x, y
mass position vx, vy mass velocity Four
control actionU (1,0), (0,1),
(-1,0),(0,-1) Goal reach a target (xG, yG) with
the mass at specific time T
Terminal reward function
17The system dynamics
Consider a Boltzmann-like stochastic policy
where
18(No Transcript)
19Conclusion
- Described a reinforcement learning method for
approximating the gradient of a continuous-time
deterministic problem with respect to the control
parameters - Used a stochastic policy to approximate the
continuous system by a consistent stochastic
discrete process