Title: Space-Indexed Dynamic Programming: Learning to Follow Trajectories
1Space-Indexed Dynamic Programming Learning to
Follow Trajectories
- J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu,
Charles DuHadway - Computer Science DepartmentStanford University
- July 2008, ICML
TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAAA
2Outline
- Reinforcement Learning and Following Trajectories
- Space-indexed Dynamical Systems and Space-indexed
Dynamic Programming - Experimental Results
3Reinforcement Learning and Following Trajectories
4Trajectory Following
- Consider task of following trajectory in a
vehicle such as a car or helicopter - State space too large to discretize, cant apply
tabular RL/dynamic programming
5Trajectory Following
- Dynamic programming algorithms w/ non-stationary
policies seem well-suited to task - Policy Search by Dynamic Programming (Bagnell,
et. al), Differential Dynamic Programming
(Jacobson and Mayne)
6Dynamic Programming
t1
Divide control task into discrete time steps
7Dynamic Programming
t1
t2
Divide control task into discrete time steps
8Dynamic Programming
t4
t5
t3
t1
t2
Divide control task into discrete time steps
9Dynamic Programming
t4
t5
t3
t1
t2
Proceeding backwards in time, learn policies
fort T, T-1, , 2, 1
10Dynamic Programming
t4
t5
t3
t1
t2
Proceeding backwards in time, learn policies
fort T, T-1, , 2, 1
11Dynamic Programming
t4
t5
t3
t1
t2
Proceeding backwards in time, learn policies
fort T, T-1, , 2, 1
12Dynamic Programming
t4
t5
t3
t1
t2
Proceeding backwards in time, learn policies
fort T, T-1, , 2, 1
13Dynamic Programming
t4
t5
t3
t1
t2
Key Advantage Policies are local (only need to
perform well over small portion of state space)
14Problems with Dynamic Programming
Problem 1 Policies from traditional dynamic
programming algorithms are time-indexed
15Problems with Dynamic Programming
Supposed we learned policy assuming this
distribution over states
16Problems with Dynamic Programming
But, due to natural stochasticity of environment,
car is actually here at t 5
17Problems with Dynamic Programming
Resulting policy will perform very poorly
18Problems with Dynamic Programming
Partial Solution Re-indexingExecute policy
closest to current location, regardless of time
19Problems with Dynamic Programming
Problem 2 Uncertainty over future states makes
it hard to learn any good policy
20Problems with Dynamic Programming
Dist. over states at time t 5
Due to stochasticity, large uncertainty over
states in distant future
21Problems with Dynamic Programming
Dist. over states at time t 5
DP algorithms require learning policy that
performs well over entire distribution
22Space-Indexed Dynamic Programming
- Basic idea of Space-Indexed Dynamic Programming
(SIDP)
Perform DP with respect to space indices (planes
tangent to trajectory)
23Space-Indexed Dynamical Systems and Dynamic
Programming
24Difficulty with SIDP
- No guarantee that taking single action will move
to next plane along trajectory - Introduce notion of space-indexed dynamical system
25Time-Indexed Dynamical System
- Creating time-indexed dynamical systems
26Time-Indexed Dynamical System
- Creating time-indexed dynamical systems
current state
27Time-Indexed Dynamical System
- Creating time-indexed dynamical systems
control action
current state
28Time-Indexed Dynamical System
- Creating time-indexed dynamical systems
control action
time derivative of state
current state
29Time-Indexed Dynamical System
- Creating time-indexed dynamical systems
Euler integration
30Space-Indexed Dynamical Systems
- Creating space-indexed dynamical systems
- Simulate forward until whenever vehicle hits
next tangent plane
space index d1
space index d
31Space-Indexed Dynamical Systems
- Creating space-indexed dynamical systems
32Space-Indexed Dynamical Systems
- Creating space-indexed dynamical systems
(Positive solution exists as long
as controller makes some forward progress)
33Space-Indexed Dynamical Systems
- Result is a dynamical system indexed by
spatial-index variable d rather than time - Space-indexed dynamic programming runs DP
directly on this system
34Space-Indexed Dynamic Programming
d1
Divide trajectory into discrete space planes
35Space-Indexed Dynamic Programming
d1
d2
Divide trajectory into discrete space planes
36Space-Indexed Dynamic Programming
d4
d5
d3
d1
d2
Divide trajectory into discrete space planes
37Space-Indexed Dynamic Programming
d4
d5
d3
d1
d2
Proceeding backwards, learn policies ford D,
D-1, , 2, 1
38Space-Indexed Dynamic Programming
d4
d5
d3
d1
d2
Proceeding backwards, learn policies ford D,
D-1, , 2, 1
39Space-Indexed Dynamic Programming
d4
d5
d3
d1
d2
Proceeding backwards, learn policies ford D,
D-1, , 2, 1
40Space-Indexed Dynamic Programming
d4
d5
d3
d1
d2
Proceeding backwards, learn policies ford D,
D-1, , 2, 1
41Problems with Dynamic Programming
Problem 1 Policies from traditional dynamic
programming algorithms are time-indexed
42Space-Indexed Dynamic Programming
Space indexed DP always executes policy based on
current spatial index
Time indexed DP can execute policy learned for
different location
43Problems with Dynamic Programming
Problem 2 Uncertainty over future states makes
it hard to learn any good policy
44Space-Indexed Dynamic Programming
Dist. over states at time t 5
Dist. over states at index d 5
Space indexed DP much tighter distribution over
future states
Time indexed DP wide distribution over future
states
45Space-Indexed Dynamic Programming
Dist. over states at time t 5
Dist. over states at index d 5
t(5)
Space indexed DP much tighter distribution over
future states
Time indexed DP wide distribution over future
states
46Experiments
47Experimental Domain
- Task following race track trajectory in RC car
with randomly placed obstacles
48Experimental Setup
- Implemented space-indexed version of PSDP
algorithm - Policy chooses steering angle using SVM
classifier (constant velocity) - Used simple textbook model simulator of car
dynamics to learn policy - Evaluated PSDP time-indexed, time-indexed with
re-indexing and space-indexed
49Time-Indexed PSDP
50Time-Indexed PSDP w/ Re-indexing
51Space-Indexed PSDP
52Empirical Evaluation
Time-indexed PSDP
Time-indexed PSDP with Re-indexing
Space-indexed PSDP
Cost Infinite (no trajectory succeeds)
Cost 49.32
Cost 59.74
53Additional Experiments
- In the paper additional experiments on the
Stanford Grand Challenge Car using space-indexed
DDP, and on a simulated helicopter domain using
space-indexed PSDP
54Related Work
- Reinforcement learning / dynamic programming
Bagnell et al., 2004 Jacobson and Mayne, 1970
Lagoudakis and Parr, 2003 Langford and Zadrozny,
2005 - Differential Dynamic Programming Atkeson, 1994
Tassa et al., 2008 - Gain Scheduling, Model Predictive Control Leith
and Leithead, 2000 Garica et al., 1989
55Summary
- Trajectory following uses non-stationary
policies, but traditional DP / RL algorithms
suffer because they are time-indexed - In this paper, we introduce the notions of a
space-indexed dynamical system, and space-indexed
dynamic programming - Demonstrated usefulness of these methods on
real-world control tasks.
56Thank you!
- Videos available online athttp//cs.stanford.edu/
kolter/icml08videos