Apprenticeship Learning for Robotic Control - PowerPoint PPT Presentation

About This Presentation
Title:

Apprenticeship Learning for Robotic Control

Description:

... each video, the left sub-panel shows a demonstration of a different driving ' ... RC Car: Circle. RC Car: Figure-8 Maneuver. Conclusion ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 44
Provided by: pieter4
Learn more at: http://ai.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Apprenticeship Learning for Robotic Control


1
  • Apprenticeship Learning for Robotic Control
  • Pieter Abbeel
  • Stanford University
  • Joint work with Andrew Y. Ng, Adam Coates, J.
    Zico Kolter and Morgan Quigley

2
Motivation for apprenticeship learning
3
Outline
  • Preliminary reinforcement learning.
  • Apprenticeship learning algorithms.
  • Experimental results on various robotic
    platforms.

4
Reinforcement learning (RL)
System Dynamics Psa
System dynamics Psa
System Dynamics Psa
state s0

sT
s1
sT-1
s2
a0
aT-1
a1
reward R(s0)
R(s2)
R(sT-1)
R(s1)
R(sT)




Example reward function R(s) - s s
Goal Pick actions over time so as to maximize
the expected score ER(s0) R(s1)
R(sT) Solution policy ? which specifies an
action for each possible state for all times t
0, 1, , T.
5
Model-based reinforcement learning
Control policy ?
Run RL algorithm in simulator.
6
Reinforcement learning (RL)
  • Apprenticeship learning algorithms use a
    demonstration to help us find
  • a good dynamics model,
  • a good reward function,
  • a good control policy.

7
Apprenticeship learning for the dynamics model
Dynamics Model Psa
Reinforcement Learning
Reward Function R
Control policy p
8
Motivating example
Collect flight data.
How to fly helicopter for data collection? How to
ensure that entire flight envelope is covered by
the data collection process?
  • Textbook model
  • Specification
  • Textbook model
  • Specification

Accurate dynamics model Psa
Accurate dynamics model Psa
Learn model from data.
9
Learning the dynamical model
  • State-of-the-art E3 algorithm, Kearns and Singh
    (2002). (And its variants/extensions Kearns
    and Koller, 1999 Kakade, Kearns and Langford,
    2003 Brafman and Tennenholtz, 2002.)

NO
YES
Explore
Exploit
10
Learning the dynamical model
  • State-of-the-art E3 algorithm, Kearns and Singh
    (2002). (And its variants/extensions Kearns
    and Koller, 1999 Kakade, Kearns and Langford,
    2003 Brafman and Tennenholtz, 2002.)

Exploration policies are impractical they do not
even try to perform well.
NO
YES
Can we avoid explicit exploration and just
exploit?
Explore
Exploit
11
Apprenticeship learning of the model
Autonomous flight
Teacher human pilot flight
Dynamics Model Psa
Learn Psa
Learn Psa
(a1, s1, a2, s2, a3, s3, .)
(a1, s1, a2, s2, a3, s3, .)
Reinforcement Learning
Reward Function R
Control policy p
No explicit exploration, always try to fly as
well as possible.
ICML 2005
12
Theorem.
  • Assuming a polynomial number of teacher
    demonstrations,
  • then after a polynomial number of trials, with
    probability 1- ?
  • E sum of rewards policy returned by algorithm
  • E sum of rewards teachers policy - ?.
  • Here, polynomial is with respect to
  • 1/?,
  • 1/?,
  • the horizon T,
  • the maximum reward R,
  • the size of the state space.

13
Learning the dynamics model
  • Details of algorithm for learning dynamics model
  • Exploiting structure from physics
  • Lagged learning criterion

NIPS 2005, 2006
14
Helicopter flight results
  • First high-speed autonomous funnels.
  • Speed 5m/s. Nominal pitch angle 30 degrees.

15
Autonomous nose-in funnel
16
Accuracy
17
Autonomous tail-in funnel
18
Key points
  • Unlike exploration methods, our algorithm
    concentrates on the task of interest.
  • Bootstrapping off an initial teacher
    demonstration is sufficient to perform the task
    as well as the teacher.

19
(No Transcript)
20
Apprenticeship learning reward
Dynamics Model Psa
Reinforcement Learning
Reward Function R
Control policy p
21
Example task driving
22
Related work
  • Previous work
  • Learn to predict teachers actions as a function
    of states.
  • E.g., Pomerleau, 1989 Sammut et al., 1992
    Kuniyoshi et al., 1994 Demiris Hayes, 1994
    Amit Mataric, 2002 Atkeson Schaal, 1997
  • Assumes policy simplicity.
  • Our approach
  • Assumes reward simplicity and is based on
    inverse reinforcement learning (Ng Russell,
    2000).
  • Similar work since Ratliff et al., 2006, 2007.

23
Inverse reinforcement learning
  • Find R s.t. R is consistent with the teachers
    policy ? being optimal.
  • Find R s.t.
  • Find w
  • Linear constraints in w, quadratic objective ?
    QP.
  • Very large number of constraints.

24
Algorithm
  • For i 1, 2,
  • Inverse RL step
  • RL step ( constraint generation)
  • Compute optimal policy ?i for the estimated
    reward Rw.

25
Theoretical guarantees
  • Theorem.
  • After at most nT 2/?2 iterations our algorithm
    returns a policy ? that performs as well as the
    teacher according to the teachers unknown reward
    function, i.e.,
  • Note Our algorithm does not necessarily recover
    the teachers reward function R --- which is
    impossible to recover.

ICML 2004
26
Performance guarantee intuition
  • Intuition by example
  • Let
  • If the returned policy ? satisfies
  • Then no matter what the values of and
    are, the policy ? performs as well as the
    teachers policy ?.

27
Case study Highway driving
Input Driving demonstration
Output Learned behavior
The only input to the learning algorithm was the
driving demonstration (left panel). No reward
function was provided.
28
More driving examples
Driving demonstration
Driving demonstration
Learned behavior
Learned behavior
In each video, the left sub-panel shows a
demonstration of a different driving style, and
the right sub-panel shows the behavior learned
from watching the demonstration.
29
Helicopter
25 features
Differential dynamic programming Jacobson
Mayne, 1970 Anderson Moore, 1989
NIPS 2007
30
Autonomous aerobatics
  • Show helicopter movie in Media Player.

31
Quadruped
32
Quadruped
  • Reward function trades off
  • Height differential of terrain.
  • Gradient of terrain around each foot.
  • Height differential between feet.
  • (25 features total for our setup)

33
Teacher demonstration for quadruped
  • Full teacher demonstration sequence of
    footsteps.
  • Much simpler to teach hierarchically
  • Specify a body path.
  • Specify best footstep in a small area.

34
Hierarchical inverse RL
  • Quadratic programming problem (QP)
  • quadratic objective, linear constraints.
  • Constraint generation for path constraints.

35
Experimental setup
  • Training
  • Have quadruped walk straight across a fairly
    simple board with fixed-spaced foot placements.
  • Around each foot placement label the best foot
    placement. (about 20 labels)
  • Label the best body-path for the training board.
  • Use our hierarchical inverse RL algorithm to
    learn a reward function from the footstep and
    path labels.
  • Test on hold-out terrains
  • Plan a path across the test-board.

36
Quadruped on test-board
  • Show movie in Media Player.

37
(No Transcript)
38
Apprenticeship learning RL algorithm
  • (Sloppy) demonstration
  • (Crude) model
  • Small number of real-life trials

39
Experiments
  • Two Systems
  • RC car Fixed-wing flight simulator

Control actions throttle and steering.
40
RC Car Circle
41
RC Car Figure-8 Maneuver
42
Conclusion
  • Apprenticeship learning algorithms help us find
    better controllers by exploiting teacher
    demonstrations.
  • Our current work exploits teacher demonstrations
    to find
  • a good dynamics model,
  • a good reward function,
  • a good control policy.

43
Acknowledgments
  • Adam Coates, Morgan Quigley, Andrew Y. Ng
  • Andrew Y. Ng
  • Morgan Quigley, Andrew Y. Ng
  • J. Zico Kolter, Andrew Y. Ng
Write a Comment
User Comments (0)
About PowerShow.com