Apprenticeship Learning for Robotic Control - PowerPoint PPT Presentation

About This Presentation

Title:

Apprenticeship Learning for Robotic Control

Description:

... each video, the left sub-panel shows a demonstration of a different driving ' ... RC Car: Circle. RC Car: Figure-8 Maneuver. Conclusion ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 44

Provided by: pieter4

Learn more at: http://ai.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Apprenticeship Learning for Robotic Control

1

Apprenticeship Learning for Robotic Control
Pieter Abbeel
Stanford University
Joint work with Andrew Y. Ng, Adam Coates, J.
Zico Kolter and Morgan Quigley

2
Motivation for apprenticeship learning
3
Outline

Preliminary reinforcement learning.
Apprenticeship learning algorithms.
Experimental results on various robotic
platforms.

4
Reinforcement learning (RL)
System Dynamics Psa
System dynamics Psa
System Dynamics Psa
state s0

sT
s1
sT-1
s2
a0
aT-1
a1
reward R(s0)
R(s2)
R(sT-1)
R(s1)
R(sT)

Example reward function R(s) - s s
Goal Pick actions over time so as to maximize
the expected score ER(s0) R(s1)
R(sT) Solution policy ? which specifies an
action for each possible state for all times t
0, 1, , T.
5
Model-based reinforcement learning
Control policy ?
Run RL algorithm in simulator.
6
Reinforcement learning (RL)

Apprenticeship learning algorithms use a
demonstration to help us find
a good dynamics model,
a good reward function,
a good control policy.

7
Apprenticeship learning for the dynamics model
Dynamics Model Psa
Reinforcement Learning
Reward Function R
Control policy p
8
Motivating example
Collect flight data.
How to fly helicopter for data collection? How to
ensure that entire flight envelope is covered by
the data collection process?

Textbook model
Specification

Textbook model
Specification

Accurate dynamics model Psa
Accurate dynamics model Psa
Learn model from data.
9
Learning the dynamical model

State-of-the-art E3 algorithm, Kearns and Singh
(2002). (And its variants/extensions Kearns
and Koller, 1999 Kakade, Kearns and Langford,
2003 Brafman and Tennenholtz, 2002.)

NO
YES
Explore
Exploit
10
Learning the dynamical model

State-of-the-art E3 algorithm, Kearns and Singh
(2002). (And its variants/extensions Kearns
and Koller, 1999 Kakade, Kearns and Langford,
2003 Brafman and Tennenholtz, 2002.)

Exploration policies are impractical they do not
even try to perform well.
NO
YES
Can we avoid explicit exploration and just
exploit?
Explore
Exploit
11
Apprenticeship learning of the model
Autonomous flight
Teacher human pilot flight
Dynamics Model Psa
Learn Psa
Learn Psa
(a1, s1, a2, s2, a3, s3, .)
(a1, s1, a2, s2, a3, s3, .)
Reinforcement Learning
Reward Function R
Control policy p
No explicit exploration, always try to fly as
well as possible.
ICML 2005
12
Theorem.

Assuming a polynomial number of teacher
demonstrations,
then after a polynomial number of trials, with
probability 1- ?
E sum of rewards policy returned by algorithm
E sum of rewards teachers policy - ?.
Here, polynomial is with respect to
1/?,
1/?,
the horizon T,
the maximum reward R,
the size of the state space.

13
Learning the dynamics model

Details of algorithm for learning dynamics model
Exploiting structure from physics
Lagged learning criterion

NIPS 2005, 2006
14
Helicopter flight results

First high-speed autonomous funnels.
Speed 5m/s. Nominal pitch angle 30 degrees.

15
Autonomous nose-in funnel
16
Accuracy
17
Autonomous tail-in funnel
18
Key points

Unlike exploration methods, our algorithm
concentrates on the task of interest.
Bootstrapping off an initial teacher
demonstration is sufficient to perform the task
as well as the teacher.

19
(No Transcript)
20
Apprenticeship learning reward
Dynamics Model Psa
Reinforcement Learning
Reward Function R
Control policy p
21
Example task driving
22
Related work

Previous work
Learn to predict teachers actions as a function
of states.
E.g., Pomerleau, 1989 Sammut et al., 1992
Kuniyoshi et al., 1994 Demiris Hayes, 1994
Amit Mataric, 2002 Atkeson Schaal, 1997
Assumes policy simplicity.
Our approach
Assumes reward simplicity and is based on
inverse reinforcement learning (Ng Russell,
2000).
Similar work since Ratliff et al., 2006, 2007.

23
Inverse reinforcement learning

Find R s.t. R is consistent with the teachers
policy ? being optimal.
Find R s.t.
Find w
Linear constraints in w, quadratic objective ?
QP.
Very large number of constraints.

24
Algorithm

For i 1, 2,
Inverse RL step
RL step ( constraint generation)
Compute optimal policy ?i for the estimated
reward Rw.

25
Theoretical guarantees

Theorem.
After at most nT 2/?2 iterations our algorithm
returns a policy ? that performs as well as the
teacher according to the teachers unknown reward
function, i.e.,
Note Our algorithm does not necessarily recover
the teachers reward function R --- which is
impossible to recover.

ICML 2004
26
Performance guarantee intuition

Intuition by example
Let
If the returned policy ? satisfies
Then no matter what the values of and
are, the policy ? performs as well as the
teachers policy ?.

27
Case study Highway driving
Input Driving demonstration
Output Learned behavior
The only input to the learning algorithm was the
driving demonstration (left panel). No reward
function was provided.
28
More driving examples
Driving demonstration
Driving demonstration
Learned behavior
Learned behavior
In each video, the left sub-panel shows a
demonstration of a different driving style, and
the right sub-panel shows the behavior learned
from watching the demonstration.
29
Helicopter
25 features
Differential dynamic programming Jacobson
Mayne, 1970 Anderson Moore, 1989
NIPS 2007
30
Autonomous aerobatics

Show helicopter movie in Media Player.

31
Quadruped
32
Quadruped

Reward function trades off
Height differential of terrain.
Gradient of terrain around each foot.
Height differential between feet.
(25 features total for our setup)

33
Teacher demonstration for quadruped

Full teacher demonstration sequence of
footsteps.
Much simpler to teach hierarchically
Specify a body path.
Specify best footstep in a small area.

34
Hierarchical inverse RL

Quadratic programming problem (QP)
quadratic objective, linear constraints.
Constraint generation for path constraints.

35
Experimental setup

Training
Have quadruped walk straight across a fairly
simple board with fixed-spaced foot placements.
Around each foot placement label the best foot
placement. (about 20 labels)
Label the best body-path for the training board.
Use our hierarchical inverse RL algorithm to
learn a reward function from the footstep and
path labels.
Test on hold-out terrains
Plan a path across the test-board.

36
Quadruped on test-board

Show movie in Media Player.

37
(No Transcript)
38
Apprenticeship learning RL algorithm

(Sloppy) demonstration
(Crude) model
Small number of real-life trials

39
Experiments

Two Systems
RC car Fixed-wing flight simulator

Control actions throttle and steering.
40
RC Car Circle
41
RC Car Figure-8 Maneuver
42
Conclusion

Apprenticeship learning algorithms help us find
better controllers by exploiting teacher
demonstrations.
Our current work exploits teacher demonstrations
to find
a good dynamics model,
a good reward function,
a good control policy.

43
Acknowledgments