Apprenticeship Learning via Inverse Reinforcement Learning

About This Presentation

Title:

Apprenticeship Learning via Inverse Reinforcement Learning

Description:

task of learning from observing an expert/teacher. Previous work: ... IRL step: Estimate expert's reward function R(s)= wT (s) by solving following QP ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 16

Provided by: pieter4

Learn more at: http://ai.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Apprenticeship Learning via Inverse Reinforcement Learning

1
Apprenticeship Learning via Inverse Reinforcement
Learning

Pieter Abbeel
Andrew Y. Ng
Stanford University

2
Motivation

Typical RL setting
Given system model, reward function
Return policy optimal with respect to the given
model and reward function
Reward function might be hard to exactly specify
E.g. driving well on a highway need to trade-off
Distance, speed, lane preference

3
Apprenticeship Learning

task of learning from observing an
expert/teacher
Previous work
Mostly try to mimic teacher by learning the
mapping from states to actions directly
Lack of strong performance guarantees
Our approach
Returns policy with performance as good as the
expert as measured according to the experts
unknown reward function
Reduces the problem to solving the control
problem with given reward
Algorithm inspired by Inverse Reinforcement
Learning (Ng and Russell, 2000)

4
Preliminaries

Markov Decision Process (S,A,T, ?,D,R)
R(s)wT?(s)
? S ? 0,1k k-dimensional feature vector
Value of a policy
Uw(?) E ?t ?t R(st)? E ?t ?t wT?(st)?
wT E ?t ?t ?(st)?
Feature distribution ?(?)
?(?) E ?t ?t ?(st)? ? 1/(1- ?) 0,1k
Uw(?) wT?(?)

5
Algorithm

For t 1,2,
IRL step Estimate experts reward function R(s)
wT?(s) by solving following QP
maxz,w z s.t.
Uw(?E)- Uw(?t) ? z
for j0..t-1(linearconstraint in w)
w2 ? 1
RL step compute optimal policy ?t for this
reward w.

6
Algorithm (2)
?2
?E
?(?2)
?(?1)
W(2)
W(1)
?(?0)
?1
7
Feature Distribution Closeness and Performance

If we can find a policy ? such that
?(?) - ?E 2 ? ?
then we have for any underlying reward R(s)
wT?(s) (with w2 ? 1)
Uw(?) - Uw(?E) wT ?(?) - wT ?E
? ?

8
Theoretical Results Convergence

Let an MDP\R, k-dimensional feature vector ? be
given. Then after at most
O(poly(k, 1/?))
iterations the algorithm outputs a policy ? that
performs nearly as well as the teacher, as
evaluated on the unknown reward function
RwT?(s)
Uw(?) ? Uw(?E) - ?.

9
Theoretical Results Sampling

In practice, we have to use sampling estimates
for the feature distribution of the expert. We
still have ?-optimal performance with high
probability for number of samples
O(poly(k,1/?))

10
Experiments Gridworld (ctd)
128x128 gridworld, 4 actions (4 compass
directions), 70 success (otherwise random among
other neighbouring squares) Non-overlapping
regions of 16x16 cells are the features. A
small number have non-zero (positive) rewards.
Expert optimal w.r.t. some weights w
11
Experiments Car Driving
12
Car Driving Results
Collision Offroad Left Left Lane Middle Lane Right Lane Offroad Right
1 Feature Distr. Expert 0 0 0.1325 0.2033 0.5983 0.0658
Feature Distr. Learned 5.00E-05 0.0004 0.0904 0.2286 0.604 0.0764
Weights Learned -0.0767 -0.0439 0.0077 0.0078 0.0318 -0.0035
2 Feature Distr. Expert 0.1167 0 0.0633 0.4667 0.47 0
Feature Distr. Learned 0.1332 0 0.1045 0.3196 0.5759 0
Weights Learned 0.234 -0.1098 0.0092 0.0487 0.0576 -0.0056
3 Feature Distr. Expert 0 0 0 0.0033 0.7058 0.2908
Feature Distr. Learned 0 0 0 0 0.7447 0.2554
Weights Learned -0.1056 -0.0051 -0.0573 -0.0386 0.0929 0.0081
4 Feature Distr. Expert 0.06 0 0 0.0033 0.2908 0.7058
Feature Distr. Learned 0.0569 0 0 0 0.2666 0.7334
Weights Learned 0.1079 -0.0001 -0.0487 -0.0666 0.059 0.0564
5 Feature Distr. Expert 0.06 0 0 1 0 0
Feature Distr. Learned 0.0542 0 0 1 0 0
Weights Learned 0.0094 -0.0108 -0.2765 0.8126 -0.51 -0.0153
13
Conclusion

Our algorithm returns policy with performance as
good as the expert as evaluated according to the
experts unknown reward function
Reduced the problem to solving the control
problem with given reward
Algorithm guaranteed to converge in poly(k,1/?)
iterations
Sample complexity poly(k,1/?)

14
Appendix Different View

Bellman LP for solving MDPs
Min. V cV s.t.
? s,a V(s) ? R(s,a) ? ?s P(s,a,s)V(s)
Dual LP
Max. ? ?s,a ?(s,a)R(s,a) s.t.
?s c(s) - ?a ?(s,a) ? ?s,a P(s,a,s)
?(s,a) 0
Apprenticeship Learning as QP
Min. ? ?i (?E,i - ?s,a ?(s,a)?i(s))2 s.t.
?s c(s) - ?a ?(s,a) ? ?s,a P(s,a,s)
?(s,a) 0

15
Different View (ctd.)

Our algorithm is equivalent to iteratively
linearize QP at current point (IRL step)
solve resulting LP (RL step)
Why not solving QP directly? Typically only
possible for very small toy problems (curse of
dimensionality).

Write a Comment

User Comments (0)