Title: Apprenticeship Learning for Robotic Control
1- Apprenticeship Learning for Robotic Control
- Pieter Abbeel
- Stanford University
- Joint work with Andrew Y. Ng, Adam Coates, J.
Zico Kolter and Morgan Quigley
2Motivation for apprenticeship learning
3Outline
- Preliminary reinforcement learning.
- Apprenticeship learning algorithms.
- Experimental results on various robotic
platforms.
4Reinforcement learning (RL)
System Dynamics Psa
System dynamics Psa
System Dynamics Psa
state s0
sT
s1
sT-1
s2
a0
aT-1
a1
reward R(s0)
R(s2)
R(sT-1)
R(s1)
R(sT)
Example reward function R(s) - s s
Goal Pick actions over time so as to maximize
the expected score ER(s0) R(s1)
R(sT) Solution policy ? which specifies an
action for each possible state for all times t
0, 1, , T.
5Model-based reinforcement learning
Control policy ?
Run RL algorithm in simulator.
6Reinforcement learning (RL)
- Apprenticeship learning algorithms use a
demonstration to help us find - a good dynamics model,
- a good reward function,
- a good control policy.
7Apprenticeship learning for the dynamics model
Dynamics Model Psa
Reinforcement Learning
Reward Function R
Control policy p
8Motivating example
Collect flight data.
How to fly helicopter for data collection? How to
ensure that entire flight envelope is covered by
the data collection process?
- Textbook model
- Specification
- Textbook model
- Specification
Accurate dynamics model Psa
Accurate dynamics model Psa
Learn model from data.
9Learning the dynamical model
- State-of-the-art E3 algorithm, Kearns and Singh
(2002). (And its variants/extensions Kearns
and Koller, 1999 Kakade, Kearns and Langford,
2003 Brafman and Tennenholtz, 2002.)
NO
YES
Explore
Exploit
10Learning the dynamical model
- State-of-the-art E3 algorithm, Kearns and Singh
(2002). (And its variants/extensions Kearns
and Koller, 1999 Kakade, Kearns and Langford,
2003 Brafman and Tennenholtz, 2002.)
Exploration policies are impractical they do not
even try to perform well.
NO
YES
Can we avoid explicit exploration and just
exploit?
Explore
Exploit
11Apprenticeship learning of the model
Autonomous flight
Teacher human pilot flight
Dynamics Model Psa
Learn Psa
Learn Psa
(a1, s1, a2, s2, a3, s3, .)
(a1, s1, a2, s2, a3, s3, .)
Reinforcement Learning
Reward Function R
Control policy p
No explicit exploration, always try to fly as
well as possible.
ICML 2005
12Theorem.
- Assuming a polynomial number of teacher
demonstrations, - then after a polynomial number of trials, with
probability 1- ? - E sum of rewards policy returned by algorithm
- E sum of rewards teachers policy - ?.
- Here, polynomial is with respect to
- 1/?,
- 1/?,
- the horizon T,
- the maximum reward R,
- the size of the state space.
13Learning the dynamics model
- Details of algorithm for learning dynamics model
- Exploiting structure from physics
- Lagged learning criterion
NIPS 2005, 2006
14Helicopter flight results
- First high-speed autonomous funnels.
- Speed 5m/s. Nominal pitch angle 30 degrees.
15Autonomous nose-in funnel
16Accuracy
17Autonomous tail-in funnel
18Key points
- Unlike exploration methods, our algorithm
concentrates on the task of interest. - Bootstrapping off an initial teacher
demonstration is sufficient to perform the task
as well as the teacher.
19(No Transcript)
20Apprenticeship learning reward
Dynamics Model Psa
Reinforcement Learning
Reward Function R
Control policy p
21Example task driving
22Related work
- Previous work
- Learn to predict teachers actions as a function
of states. - E.g., Pomerleau, 1989 Sammut et al., 1992
Kuniyoshi et al., 1994 Demiris Hayes, 1994
Amit Mataric, 2002 Atkeson Schaal, 1997 - Assumes policy simplicity.
- Our approach
- Assumes reward simplicity and is based on
inverse reinforcement learning (Ng Russell,
2000). - Similar work since Ratliff et al., 2006, 2007.
23Inverse reinforcement learning
- Find R s.t. R is consistent with the teachers
policy ? being optimal. - Find R s.t.
- Find w
- Linear constraints in w, quadratic objective ?
QP. - Very large number of constraints.
24Algorithm
- For i 1, 2,
- Inverse RL step
-
- RL step ( constraint generation)
- Compute optimal policy ?i for the estimated
reward Rw.
25Theoretical guarantees
- Theorem.
- After at most nT 2/?2 iterations our algorithm
returns a policy ? that performs as well as the
teacher according to the teachers unknown reward
function, i.e., - Note Our algorithm does not necessarily recover
the teachers reward function R --- which is
impossible to recover.
ICML 2004
26Performance guarantee intuition
- Intuition by example
- Let
- If the returned policy ? satisfies
- Then no matter what the values of and
are, the policy ? performs as well as the
teachers policy ?. -
27Case study Highway driving
Input Driving demonstration
Output Learned behavior
The only input to the learning algorithm was the
driving demonstration (left panel). No reward
function was provided.
28More driving examples
Driving demonstration
Driving demonstration
Learned behavior
Learned behavior
In each video, the left sub-panel shows a
demonstration of a different driving style, and
the right sub-panel shows the behavior learned
from watching the demonstration.
29Helicopter
25 features
Differential dynamic programming Jacobson
Mayne, 1970 Anderson Moore, 1989
NIPS 2007
30Autonomous aerobatics
- Show helicopter movie in Media Player.
31Quadruped
32Quadruped
- Reward function trades off
- Height differential of terrain.
- Gradient of terrain around each foot.
- Height differential between feet.
- (25 features total for our setup)
33Teacher demonstration for quadruped
- Full teacher demonstration sequence of
footsteps. - Much simpler to teach hierarchically
- Specify a body path.
- Specify best footstep in a small area.
34Hierarchical inverse RL
- Quadratic programming problem (QP)
- quadratic objective, linear constraints.
- Constraint generation for path constraints.
35Experimental setup
- Training
- Have quadruped walk straight across a fairly
simple board with fixed-spaced foot placements. - Around each foot placement label the best foot
placement. (about 20 labels) - Label the best body-path for the training board.
- Use our hierarchical inverse RL algorithm to
learn a reward function from the footstep and
path labels. - Test on hold-out terrains
- Plan a path across the test-board.
36Quadruped on test-board
- Show movie in Media Player.
37(No Transcript)
38Apprenticeship learning RL algorithm
- (Sloppy) demonstration
- (Crude) model
- Small number of real-life trials
39Experiments
- Two Systems
- RC car Fixed-wing flight simulator
Control actions throttle and steering.
40RC Car Circle
41RC Car Figure-8 Maneuver
42Conclusion
- Apprenticeship learning algorithms help us find
better controllers by exploiting teacher
demonstrations. - Our current work exploits teacher demonstrations
to find - a good dynamics model,
- a good reward function,
- a good control policy.
43Acknowledgments
- Adam Coates, Morgan Quigley, Andrew Y. Ng
- Morgan Quigley, Andrew Y. Ng
- J. Zico Kolter, Andrew Y. Ng