Using Inaccurate Models in Reinforcement Learning - PowerPoint PPT Presentation

About This Presentation

Title:

Using Inaccurate Models in Reinforcement Learning

Description:

Model-based RL: Difficult to build an accurate model. ... Reinforcement learning formalism. H. t=0. Pieter Abbeel, Morgan Quigley and Andrew Y. Ng ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 24

Provided by: pieter4

Learn more at: http://ai.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Using Inaccurate Models in Reinforcement Learning

1
Using Inaccurate Models in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng
Stanford University

2
Overview

Reinforcement learning in high-dimensional
continuous state-spaces.
Model-based RL Difficult to build an accurate
model.
Model-free RL Often requires large numbers of
real-life trials.
We present a hybrid algorithm, which requires
only
an approximate model,
a small number of real-life trials.
Resulting policy is (locally) near-optimal.
Experiments on flight simulator and real RC car.

3
Reinforcement learning formalism

Markov Decision Process (MDP)
M (S, A, T , H, s0, R ).
S ?n (continuous state space)
Time varying, deterministic dynamics
T ft S x A ! S, t 0,,H.
Goal find policy ?? S ! A, that maximizes
U(??) E ? R (st) ?? .
Focus task of trajectory following.

H
t0
4
Motivating Example

Student-driver learning to make a 90 degree right
turn
Only a few trials needed.
No accurate model.
Student-driver has access to
Real-life trial.
Crude model.
Result good policy gradient estimate.

5
Algorithm Idea

Input to algorithm approximate model.
Start by computing the optimal policy according
to the model.

Real-life trajectory
Target trajectory
The policy is optimal according to the model, so
no improvement is possible based on the model.
6
Algorithm Idea (2)

Update the model such that it becomes exact for
the current policy.

7
Algorithm Idea (2)

Update the model such that it becomes exact for
the current policy.

8
Algorithm Idea (2)

The updated model perfectly predicts the state
sequence obtained under the current policy.

We can use the updated model to find an improved
policy.

9
Algorithm

Find the (locally) optimal policy ?? for the
model.
Execute the current policy ?? and record the
state trajectory.
Update the model such that the new model is exact
for the current policy ??.
Use the new model to compute the policy gradient
?? and update the policy ? ? ? ??.
Go back to Step 2.
Notes
The step-size parameter ? is determined by a line
search.
Instead of the policy gradient, any algorithm
that provides a local policy improvement
direction can be used. In our experiments we
used differential dynamic programming.

10
Performance Guarantees Intuition

Exact policy gradient
Model based policy gradient

Evaluation of derivatives along wrong trajectory
Derivative of approximate transition function

Our algorithm eliminates one (of two) sources of
error.
11
Performance Guarantees

Let the local policy improvement algorithm be
policy gradient.

Notes
These assumptions are insufficient to give the
same performance guarantees for model-based RL.
The constant K depends only on the dimensionality
of the state, action, and policy (?), the horizon
H and an upper bound on the 1st and 2nd
derivatives of the transition model, the policy
and the reward function.

12
Experiments

We use differential dynamic programming (DDP) to
find control policies in the model.
Two Systems
Flight Simulator RC Car

13
Flight Simulator Setup

Flight simulator model has 43 parameters (mass,
inertia, drag coefficients, lift coefficients
etc.).
We generated approximate models by randomly
perturbing the parameters.
All 4 standard fixed-wing control actions
throttle, ailerons, elevators and rudder.
Our reward function quadratically penalizes for
deviation from the desired trajectory.

14
Flight Simulator Movie
15
Flight Simulator Results
76 utility improvement over model-based approach
desired trajectory model-based controller our
algorithm
16
RC Car Setup

Control actions throttle and steering.
Low-speed dynamics model with state variables
Position, velocity, heading, heading rate.
Model estimated from 30 minutes of data.

17
RC Car Open-Loop Turn
18
RC Car Circle
19
RC Car Figure-8 Maneuver
20
Related Work

Iterative Learning Control
Uchiyama (1978), Longman et al. (1992), Moore
(1993), Horowitz (1993), Bien et al. (1991),
Owens et al. (1995), Chen et al. (1997),
Successful robot control with limited number of
trials
Atkeson and Schaal (1997), Morimoto and Doya
(2001).
Robust control theory
Zhou et al. (1995), Dullerud and Paganini (2000),
Bagnell et al. (2001), Morimoto and Atkeson
(2002),

21
Conclusion

We presented an algorithm that uses a crude model
and a small number of real-life trials to find a
policy that works well in real-life.
Our theoretical results show that----assuming a
deterministic setting and assuming a reasonable
model----our algorithm returns a policy that is
(locally) near-optimal.
Our experiments show that our algorithm can
significantly improve on purely model-based RL by
using only a small number of real-life trials,
even when the true system is not deterministic.

22
(No Transcript)
23
Motivating Example

Student-driver learning to make a 90 degree right
turn
Only a few trials needed.
No accurate model.
Key aspects
Real-life trial shows whether turn is wide or
short.
Crude model turning steering wheel more to the
right results in sharper turn, turning steering
wheel more to the left results in wider turn.
Result good policy gradient estimate.