Applying Online Search Techniques to Continuous-State Reinforcement Learning

1 / 35

About This Presentation

Title:

Applying Online Search Techniques to Continuous-State Reinforcement Learning

Description:

Global Search: Locale-Based Pruning. Borrow a technique from Robot ... Then, use V to guide global search with locale-based pruning, much in the style of A ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 36

Provided by: scottd153

Learn more at: http://www.cs.cmu.edu

more less

Transcript and Presenter's Notes

Title: Applying Online Search Techniques to Continuous-State Reinforcement Learning

1
Applying Online Search Techniques to
Continuous-State Reinforcement Learning

Scott Davies, Andrew Ng, and Andrew Moore
Carnegie Mellon University

2
Discrete Reinforcement Learning

Agent can be in one of a finite number of
discrete states xi
Agent gets to choose a discrete action a at each
time step
Action affects state transition probabilities
between current and next time steps
Environment punishes/rewards agent depending on
current state and action
Goal maximize long-term reward

3
Solving Discrete RL Problems

Well-known methods exist for solving such
problems with either known or unknown state
transitions and rewards
Value iteration, policy iteration, and modified
policy iteration can be used with either a priori
or learned models of the problem dynamics
Model-free methods such as Q-learning and TD-l
can be used straightforwardly in either case

4
Value Functions

All methods learn a value function

where V(x) is the discounted total reward the
agent will accrue if it acts optimally starting
at state x.
Or, equivalently (sort of), a Q-function, Q(x,a)
where Q(x,a) is the discounted total reward the
agent will accrue if it acts optimally after
taking action a in state x.
5
Continuous-State RL

Naturally, wed also like to solve RL problems
with continuous states.
Problem V(x) must be approximated

hillcar V
6
Approximate Value Functions

Constant grid

Simplex interpolation
7
The Agony of Continuous State Spaces

Learning useful value functions for
continuous-state optimal control problems can be
difficult!
Accurate value functions can be very expensive to
compute even in relatively low-dimensional spaces
with perfectly accurate state transition model
Small inaccuracies/inconsistencies in
approximated value functions can cause simple
controllers to fail miserably

8
Combining Value Functions With Online Search

Instead of modeling the value function accurately
everywhere, we can perform online searches for
good trajectories from the agents current
position to compensate for value function
inaccuracies
We examine two different types of search
Local searches in which the agent performs a
finite-depth look-ahead search
Global searches in which the agent searches for
trajectories all the way to goal states
We restrict our attention to discrete-action,
discrete-time noiseless environments

9
Local Search

Simple idea borrowed from classical game-playing
AI limited-depth lookahead.
Out of all possible d-step trajectories T,
perform the first action in the T that maximizes
reward accumulated over T V(xend),
where xend is the state at the end of T
Can also constrain search to trajectories T in
which action is switched at most s times.
Considerably cheaper if s ltlt d

10
Local Search Examples
Local search pics s0, s1, s2
11
Global Search

If hunting for a goal state, why not just do
one big uniform-cost search from the current
state all the way to the goal?
Upside dont need a value function

12
Global Search
Local Search taken to its illogical extreme

Downside too many possible trajectories!

13
Global Search Locale-Based Pruning

Borrow a technique from Robot Motion Planning
Partition state space into a grid
Only search from least-cost trajectory entering
any given partition prune all others

Start of search example
14
Global Search Locale-Based Pruning

Big advantages over classical discretization-based
DP methods
Not all cells need be visited / represented
Each cell only visited once
If model is accurate, the solution will work one
continuous trajectory no aliasing

Completed search example
15
Discretization-based algorithms

Transform continuous-state problem into
discrete-state one
Different methods of approximating V correspond
to different methods of making this
transformation
Find V for discrete-state problem with favorite
dynamic programming method (e.g., modified policy
iteration).
If discrete problem has a solution, DP will
converge to it!
When executing in the actual continuous-state
environment, agent maps continuous states to
(possibly combinations of) discrete states and
uses the discrete states values of V.

16
Simple Grid-Based Discretization

Break state space into uniform grid
Choose a representative state for each grid
cell
For each possible action from each rep. state,
integrate forward until you reach a different
cell
Equivalent to approximating V as a constant over
each grid cell

17
Multilinear Interpolation
Value at end of trajectory is interpolated as a
weighted average of the 2d surrounding corner
points values.
Equivalent to discrete problem in which each
state/action pair results in 2d possible
successor states, with transition probabilities
equal to corresponding interpolation weights.
18
Simplex-Based Interpolation

Each grid cell implicitly decomposed into d!
simplexes based on the Kuhn triangulation
Value V at end of trajectory is interpolated as
weighted sum of the values at the (d1) corner
points of enclosing simplex.
Equivalent to discrete problem in which each
state/action pair results in (d1) possible
successor states, with transition probabilities
equal to corresponding interpolation weights.

19
Global Search
Fix this pic to include goal state!

Can be useful, but
Still computationally expensive lots of cells
visited
Even with fine partitioning of state space,
pruning the wrong trajectories can lead to
suboptimal solutions or complete failure

20
Informed Global Search

First, use fast approximators to automatically
learn relatively crude V
Then, use V to guide global search with
locale-based pruning, much in the style of A

21
Informed Global Search Examples
Fix this picture
Very approximate V (77 simplex interpolation)
More accurate V (1313 simplex interpolation)
22
Acrobot

Two-link planar robot acting in vertical plane
under gravity
Underactuated joint at elbow unactuated shoulder
Two angular positions their velocities (4-d)
Goal raise tip at least one links height above
shoulder
Two actions full torque clockwise /
counterclockwise
Random starting positions
Cost total time to goal

Goal
?1
?2
23
Move-Cart-Pole

Upright pole attached to cart by unactuated joint
State horizontal position of cart, angle of
pole, and associated velocities (4-d)
Actions accelerate left or right
Goal configuration cart moved, pole balanced
Start with random x ? 0
Per-step cost quadratic in distance from goal
configuration
Big penalty if pole falls over

?
Goal configuration
x
24
Planar Slider

Puck sliding on bumpy 2-d surface
Two spatial variables their velocities (4-d)
Actions accelerate NW, NE, SW, or SE
Goal in NW corner
Random start states
Cost total time to goal

25
Local Search Experiments
Move-Cart-Pole

CPU Time and Solution cost vs. search depth d
No limits imposed on number of action switches
(sd)
Value function 134 simplex-interpolation grid

26
Local Search Experiments
Hill-car

CPU Time and Solution cost vs. search depth d
Max. number of action switches fixed at 2 (s 2)
Value function 72 simplex-interpolated value
function

27
Comparative experiments Hill-Car

Local search d6, s2
Global searches
Local search between grid elements d20, s1
502 search grid resolution
72 simplex-interpolated value function

28
Hill-Car results contd

Uninformed Global Search prunes wrong
trajectories
Increase search grid to 1002 so this doesnt
happen
Uninformed does near-optimal
Informed doesnt crude value function not
optimistic

Failed search trajectory picture goes here
29
Comparative Results Four-d domains

All value functions 134 simplex interpolations
All local searches between global search
elements
depth 20, with at max. 1 action switch (d20,
s1)
Acrobot
Local Search depth 4 no action switch
restriction (d4,s4)
Global 504 search grid
Move-Cart-Pole same as Acrobot
Slider
Local Search depth 10 max. 1 action switch
(d10,s1)
Global 204 search grid

30
Acrobot
LS number of local searches performed to find
paths between elements of global search grid

Local search significantly improves solution
quality, but increases CPU time by order of
magnitude
Uninformed global search takes even more time
poor solution quality indicates suboptimal
trajectory pruning
Informed global search finds much better
solutions in relatively little time. Value
function drastically reduces search, and better
pruning leads to better solutions

31
Move-Cart-Pole

No search pole often falls, incurring large
penalties overall poor solution quality
Local search improves things a bit
Uninformed search finds better solutions than
informed
Few grid cells in which pruning is required
Value function not optimistic, so informed search
solutions suboptimal
Informed search reduces costs by order of
magnitude with no increase in required CPU time

32
Planar Slider

Local search almost useless, and incurs massive
CPU expense
Uninformed search decreases solution cost by 50,
but at even greater CPU expense
Informed search decreases solution cost by factor
of 4, at no increase in CPU time

33
Using Search with Learned Models

Toy Example Hill-Car
72 simplex-interpolated value function
One nearest-neighbor function approximator per
possible action used to learn dx/dt
States sufficiently far away from nearest
neighbor optimistically assumed to be absorbing
to encourage exploration
Average costs over first few hundred trials
No search 212
Local search 127
Informed global search 155

34
Using Search with Learned Models

Problems do arise when using learned models
Inaccuracies in models may cause global searches
to fail. Not clear then if failure should be
blamed on model inaccuracies or on insufficiently
fine state space partitioning
Trajectories found will be inaccurate
Need adaptive closed-loop controller
Fortunately, we will get new data with which to
increase the accuracy of our model
Model approximators must be fast and accurate

35
Avenues for Future Research