Space-Indexed Dynamic Programming: Learning to Follow Trajectories - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

Space-Indexed Dynamic Programming: Learning to Follow Trajectories

Description:

Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science Department – PowerPoint PPT presentation

Number of Views:197

Avg rating:3.0/5.0

Slides: 57

Provided by: JeremyK157

Category:

more less

Transcript and Presenter's Notes

Title: Space-Indexed Dynamic Programming: Learning to Follow Trajectories

1
Space-Indexed Dynamic Programming Learning to
Follow Trajectories

J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu,
Charles DuHadway
Computer Science DepartmentStanford University
July 2008, ICML

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAAA
2
Outline

Reinforcement Learning and Following Trajectories
Space-indexed Dynamical Systems and Space-indexed
Dynamic Programming
Experimental Results

3
Reinforcement Learning and Following Trajectories
4
Trajectory Following

Consider task of following trajectory in a
vehicle such as a car or helicopter
State space too large to discretize, cant apply
tabular RL/dynamic programming

5
Trajectory Following

Dynamic programming algorithms w/ non-stationary
policies seem well-suited to task
Policy Search by Dynamic Programming (Bagnell,
et. al), Differential Dynamic Programming
(Jacobson and Mayne)

6
Dynamic Programming
t1
Divide control task into discrete time steps
7
Dynamic Programming
t1
t2
Divide control task into discrete time steps
8
Dynamic Programming
t4
t5
t3
t1
t2
Divide control task into discrete time steps
9
Dynamic Programming
t4
t5
t3
t1
t2
Proceeding backwards in time, learn policies
fort T, T-1, , 2, 1
10
Dynamic Programming
t4
t5
t3
t1
t2
Proceeding backwards in time, learn policies
fort T, T-1, , 2, 1
11
Dynamic Programming
t4
t5
t3
t1
t2
Proceeding backwards in time, learn policies
fort T, T-1, , 2, 1
12
Dynamic Programming
t4
t5
t3
t1
t2
Proceeding backwards in time, learn policies
fort T, T-1, , 2, 1
13
Dynamic Programming
t4
t5
t3
t1
t2
Key Advantage Policies are local (only need to
perform well over small portion of state space)
14
Problems with Dynamic Programming
Problem 1 Policies from traditional dynamic
programming algorithms are time-indexed
15
Problems with Dynamic Programming
Supposed we learned policy assuming this
distribution over states
16
Problems with Dynamic Programming
But, due to natural stochasticity of environment,
car is actually here at t 5
17
Problems with Dynamic Programming
Resulting policy will perform very poorly
18
Problems with Dynamic Programming
Partial Solution Re-indexingExecute policy
closest to current location, regardless of time
19
Problems with Dynamic Programming
Problem 2 Uncertainty over future states makes
it hard to learn any good policy
20
Problems with Dynamic Programming
Dist. over states at time t 5
Due to stochasticity, large uncertainty over
states in distant future
21
Problems with Dynamic Programming
Dist. over states at time t 5
DP algorithms require learning policy that
performs well over entire distribution
22
Space-Indexed Dynamic Programming

Basic idea of Space-Indexed Dynamic Programming
(SIDP)

Perform DP with respect to space indices (planes
tangent to trajectory)
23
Space-Indexed Dynamical Systems and Dynamic
Programming
24
Difficulty with SIDP

No guarantee that taking single action will move
to next plane along trajectory
Introduce notion of space-indexed dynamical system

25
Time-Indexed Dynamical System

Creating time-indexed dynamical systems

26
Time-Indexed Dynamical System

Creating time-indexed dynamical systems

current state
27
Time-Indexed Dynamical System

Creating time-indexed dynamical systems

control action
current state
28
Time-Indexed Dynamical System

Creating time-indexed dynamical systems

control action
time derivative of state
current state
29
Time-Indexed Dynamical System

Creating time-indexed dynamical systems

Euler integration
30
Space-Indexed Dynamical Systems

Creating space-indexed dynamical systems
Simulate forward until whenever vehicle hits
next tangent plane

space index d1
space index d
31
Space-Indexed Dynamical Systems

Creating space-indexed dynamical systems

32
Space-Indexed Dynamical Systems

Creating space-indexed dynamical systems

(Positive solution exists as long
as controller makes some forward progress)
33
Space-Indexed Dynamical Systems

Result is a dynamical system indexed by
spatial-index variable d rather than time
Space-indexed dynamic programming runs DP
directly on this system

34
Space-Indexed Dynamic Programming
d1
Divide trajectory into discrete space planes
35
Space-Indexed Dynamic Programming
d1
d2
Divide trajectory into discrete space planes
36
Space-Indexed Dynamic Programming
d4
d5
d3
d1
d2
Divide trajectory into discrete space planes
37
Space-Indexed Dynamic Programming
d4
d5
d3
d1
d2
Proceeding backwards, learn policies ford D,
D-1, , 2, 1
38
Space-Indexed Dynamic Programming
d4
d5
d3
d1
d2
Proceeding backwards, learn policies ford D,
D-1, , 2, 1
39
Space-Indexed Dynamic Programming
d4
d5
d3
d1
d2
Proceeding backwards, learn policies ford D,
D-1, , 2, 1
40
Space-Indexed Dynamic Programming
d4
d5
d3
d1
d2
Proceeding backwards, learn policies ford D,
D-1, , 2, 1
41
Problems with Dynamic Programming
Problem 1 Policies from traditional dynamic
programming algorithms are time-indexed
42
Space-Indexed Dynamic Programming
Space indexed DP always executes policy based on
current spatial index
Time indexed DP can execute policy learned for
different location
43
Problems with Dynamic Programming
Problem 2 Uncertainty over future states makes
it hard to learn any good policy
44
Space-Indexed Dynamic Programming
Dist. over states at time t 5
Dist. over states at index d 5
Space indexed DP much tighter distribution over
future states
Time indexed DP wide distribution over future
states
45
Space-Indexed Dynamic Programming
Dist. over states at time t 5
Dist. over states at index d 5
t(5)
Space indexed DP much tighter distribution over
future states
Time indexed DP wide distribution over future
states
46
Experiments
47
Experimental Domain

Task following race track trajectory in RC car
with randomly placed obstacles

48
Experimental Setup

Implemented space-indexed version of PSDP
algorithm
Policy chooses steering angle using SVM
classifier (constant velocity)
Used simple textbook model simulator of car
dynamics to learn policy
Evaluated PSDP time-indexed, time-indexed with
re-indexing and space-indexed

49
Time-Indexed PSDP
50
Time-Indexed PSDP w/ Re-indexing
51
Space-Indexed PSDP
52
Empirical Evaluation
Time-indexed PSDP
Time-indexed PSDP with Re-indexing
Space-indexed PSDP
Cost Infinite (no trajectory succeeds)
Cost 49.32
Cost 59.74
53
Additional Experiments

In the paper additional experiments on the
Stanford Grand Challenge Car using space-indexed
DDP, and on a simulated helicopter domain using
space-indexed PSDP

54
Related Work

Reinforcement learning / dynamic programming
Bagnell et al., 2004 Jacobson and Mayne, 1970
Lagoudakis and Parr, 2003 Langford and Zadrozny,
2005
Differential Dynamic Programming Atkeson, 1994
Tassa et al., 2008
Gain Scheduling, Model Predictive Control Leith
and Leithead, 2000 Garica et al., 1989

55
Summary

Trajectory following uses non-stationary
policies, but traditional DP / RL algorithms
suffer because they are time-indexed
In this paper, we introduce the notions of a
space-indexed dynamical system, and space-indexed
dynamic programming
Demonstrated usefulness of these methods on
real-world control tasks.

56
Thank you!