Apprenticeship Learning via Inverse Reinforcement Learning

About This Presentation

Title:

Apprenticeship Learning via Inverse Reinforcement Learning

Description:

Pieter Abbeel and Andrew Y. Ng. Apprenticeship Learning. Learning from ... Pieter Abbeel and Andrew Y. Ng. Preliminaries. Markov Decision ... and Andrew Y. ... – PowerPoint PPT presentation

Number of Views:112

Avg rating:3.0/5.0

Slides: 36

Provided by: OAO97

Learn more at: http://ai.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Apprenticeship Learning via Inverse Reinforcement Learning

1
Apprenticeship Learning via Inverse
Reinforcement Learning

Pieter Abbeel and Andrew Y. Ng
Stanford University

2
Motivation

Reinforcement learning (RL) gives powerful tools
for solving MDPs. It can be difficult to specify
the reward function. Example Highway driving.

3
Apprenticeship Learning

Learning from observing an expert.
Previous work
Learn to predict experts actions as a function
of states.
Usually lacks strong performance guarantees.
(E.g.,. Pomerleau, 1989 Sammut et al., 1992
Kuniyoshi et al., 1994 Demiris Hayes, 1994
Amit Mataric, 2002 Atkeson Schaal, 1997 )
Our approach
Based on inverse reinforcement learning (Ng
Russell, 2000).
Returns policy with performance as good as the
expert as measured according to the experts
unknown reward function.

4
Preliminaries

Markov Decision Process (S,A,T,?,D,R)
R(s)wT?(s) ,
S ? 0,1k k-dimensional feature vector.
W.l.o.g. we assume w2 1.
Policy ? S ? A
Utility of a policy?? for reward RwT?
Uw(?) E ?t ?t R(st)?.

5
Algorithm

For t 1,2,
Inverse RL step
Estimate experts reward function R(s) wT?(s)
such that under R(s) the expert performs better
than all previously found policies ?i.
RL step
Compute optimal policy ?t for
the estimated reward w.

6
Algorithm IRL step

Maximize ?, ww2 1 ?
s.t. Uw(?E) ? Uw(?i) ? i1,,t-1
? margin of experts performance over the
performance of previously found policies.
Uw(?) E ?t ?t R(st)? E ?t ?t wT?(st)?
wT E ?t ?t ?(st)?
wT ?(?)
?(?) E ?t ?t ?(st)? are the feature
expectations

7
Feature Expectation Closeness and Performance

If we can find a policy ? such that
?(?E) - ?(?)2 ? ?,
then for any underlying reward R(s) wT?(s),
we have that
Uw(?E) - Uw(?) wT ?(?E) - wT ?(?)
? w2 ?(?E) - ?(?)2
? ?.

8
Algorithm
?2
?(?E)
?(?2)
w(3)
?(?1)
w(2)
w(1)
Uw(?) wT?(?)
?(?0)
?1
9
Theoretical Results Convergence

Theorem. Let an MDP (without reward function), a
k-dimensional feature vector ? and the experts
feature expectations ?(?E) be given. Then after
at most
k/(1-?)?2
iterations, the algorithm outputs a policy ?
that performs nearly as well as the expert, as
evaluated on the unknown reward function
R(s)wT?(s), i.e.,
Uw(?) ? Uw(?E) - ?.

10
Theoretical Results Sampling

In practice, we have to use sampling to estimate
the feature expectations of the expert. We still
have ?-optimal performance with high probability
if the number of observed samples is at least
O(poly(k,1/?)).
Note the bound has no dependence on the
complexity of the policy.

11
Gridworld Experiments
Reward function is piecewise constant over small
regions. Features ? for IRL are these small
regions.
128x128 grid, small regions of size 16x16.
12
Gridworld Experiments
13
Gridworld Experiments
14
Gridworld Experiments
15
Gridworld Experiments
16
Case study Highway driving
Output Learned behavior
Input Driving demonstration
The only input to the learning algorithm was the
driving demonstration (left panel). No reward
function was provided.
17
More driving examples
In each video, the left sub-panel shows a
demonstration of a different driving style, and
the right sub-panel shows the behavior learned
from watching the demonstration.
18
Car driving results
Collision Left Shoulder Left Lane Middle Lane Right Lane Right Shoulder
? (expert) 0 0 0.13 0.20 0.60 0.07
1 ? (learned) 0 0 0.09 0.23 0.60 0.08
W (learned) -0.08 -0.04 0.01 0.01 0.03 -0.01
? (expert) 0.12 0 0.06 0.47 0.47 0
2 ? (learned) 0.13 0 0.10 0.32 0.58 0
W (learned) 0.23 -0.11 0.01 0.05 0.06 -0.01
? (expert) 0 0 0 0.01 0.70 0.29
3 ? (learned) 0 0 0 0 0.74 0.26
W (learned) -0.11 -0.01 -0.06 -0.04 0.09 0.01
19
Conclusions

Our algorithm returns a policy with performance
as good as the expert as evaluated according to
the experts unknown reward function.
Algorithm is guaranteed to converge in
poly(k,1/?) iterations.
Sample complexity poly(k,1/?).
The algorithm exploits reward simplicity (vs.
policy simplicity in previous approaches).
Poster dual formulation cheaper inverse RL
step without the optimization.

20
Additional slides for poster

(slides to come are additional material, not
included in the talk, in particular projection
(vs. QP) version of the Inverse RL step another
formulation of the apprenticeship learning
problem, and its relation to our algorithm)

21
Simplification of Inverse RL step QP ? Euclidean
projection

In the Inverse RL step
set ?(i-1) orthogonal projection of ?E onto
line through ?(i-1),?(?(i-1))
set w(i) ?E - ?(i-1)
Note the theoretical results on convergence and
sample complexity hold unchanged for the simpler
algorithm.

22
Algorithm (projection version)
?2
?E
?(?1)
w(1)
?(?0)
?1
23
Algorithm (projection version)
?2
?E
?(?2)
?(?1)
w(2)
?(1)
w(1)
?(?0)
?1
24
Algorithm (projection version)
?2
?E
?(?2)
?(?1)
w(3)
w(2)
?(2)
?(1)
w(1)
?(?0)
?1
25
Appendix Different View

Bellman LP for solving MDPs
Min. V cV s.t.
? s,a V(s) ? R(s,a) ? ?s P(s,a,s)V(s)
Dual LP
Max. ? ?s,a ?(s,a)R(s,a) s.t.
?s c(s) - ?a ?(s,a) ? ?s,a P(s,a,s) ?(s,a)
0
Apprenticeship Learning as QP
Min. ? ?i (?E,i - ?s,a ?(s,a)?i(s))2 s.t.
?s c(s) - ?a ?(s,a) ? ?s,a P(s,a,s) ?(s,a)
0

26
Different View (ctd.)

Our algorithm is equivalent to iteratively
linearize QP at current point (Inverse RL step),
solve resulting LP (RL step).
Why not solving QP directly? Typically only
possible for very small toy problems (curse of
dimensionality). Our algorithm makes use of
existing RL solvers to deal with the curse of
dimensionality.

27
Slides that are different for poster

(slides to come are slightly different for
poster, but already appeared earlier)

28
Algorithm (QP version)
?2
?(?E)
?(?1)
w(1)
Uw(?) wT?(?)
?(?0)
?1
29
Algorithm (QP version)
?2
?(?E)
?(?2)
?(?1)
w(2)
w(1)
Uw(?) wT?(?)
?(?0)
?1
30
Algorithm (QP version)
?2
?(?E)
?(?2)
w(3)
?(?1)
w(2)
w(1)
Uw(?) wT?(?)
?(?0)
?1
31
Gridworld Experiments
32
Case study Highway driving
Output Learned behavior
Input Driving demonstration
(Videos available.)
33
More driving examples
(Videos available.)
34
Car driving results (more detail)
Collision Offroad Left Left Lane Middle Lane Right Lane Offroad Right
1 Feature Distr. Expert 0 0 0.1325 0.2033 0.5983 0.0658
Feature Distr. Learned 5.00E-05 0.0004 0.0904 0.2286 0.604 0.0764
Weights Learned -0.0767 -0.0439 0.0077 0.0078 0.0318 -0.0035
2 Feature Distr. Expert 0.1167 0 0.0633 0.4667 0.47 0
Feature Distr. Learned 0.1332 0 0.1045 0.3196 0.5759 0
Weights Learned 0.234 -0.1098 0.0092 0.0487 0.0576 -0.0056
3 Feature Distr. Expert 0 0 0 0.0033 0.7058 0.2908
Feature Distr. Learned 0 0 0 0 0.7447 0.2554
Weights Learned -0.1056 -0.0051 -0.0573 -0.0386 0.0929 0.0081
4 Feature Distr. Expert 0.06 0 0 0.0033 0.2908 0.7058
Feature Distr. Learned 0.0569 0 0 0 0.2666 0.7334
Weights Learned 0.1079 -0.0001 -0.0487 -0.0666 0.059 0.0564
5 Feature Distr. Expert 0.06 0 0 1 0 0
Feature Distr. Learned 0.0542 0 0 1 0 0
Weights Learned 0.0094 -0.0108 -0.2765 0.8126 -0.51 -0.0153
35
Apprenticeship Learning via Inverse
Reinforcement Learning

Pieter Abbeel and Andrew Y. Ng
Stanford University

Write a Comment

User Comments (0)

About PowerShow.com

Apprenticeship Learning via Inverse Reinforcement Learning - PowerPoint PPT Presentation

Apprenticeship Learning via Inverse Reinforcement Learning

Pieter Abbeel and Andrew Y. Ng. Apprenticeship Learning. Learning from ... Pieter Abbeel and Andrew Y. Ng. Preliminaries. Markov Decision ... and Andrew Y. ... – PowerPoint PPT presentation