Title: Applying Online Search Techniques to Continuous-State Reinforcement Learning
1Applying Online Search Techniques to
Continuous-State Reinforcement Learning
- Scott Davies, Andrew Ng, and Andrew Moore
- Carnegie Mellon University
2Discrete Reinforcement Learning
- Agent can be in one of a finite number of
discrete states xi - Agent gets to choose a discrete action a at each
time step - Action affects state transition probabilities
between current and next time steps - Environment punishes/rewards agent depending on
current state and action - Goal maximize long-term reward
3Solving Discrete RL Problems
- Well-known methods exist for solving such
problems with either known or unknown state
transitions and rewards - Value iteration, policy iteration, and modified
policy iteration can be used with either a priori
or learned models of the problem dynamics - Model-free methods such as Q-learning and TD-l
can be used straightforwardly in either case
4Value Functions
- All methods learn a value function
where V(x) is the discounted total reward the
agent will accrue if it acts optimally starting
at state x.
Or, equivalently (sort of), a Q-function, Q(x,a)
where Q(x,a) is the discounted total reward the
agent will accrue if it acts optimally after
taking action a in state x.
5Continuous-State RL
- Naturally, wed also like to solve RL problems
with continuous states. - Problem V(x) must be approximated
hillcar V
6Approximate Value Functions
Simplex interpolation
7The Agony of Continuous State Spaces
- Learning useful value functions for
continuous-state optimal control problems can be
difficult! - Accurate value functions can be very expensive to
compute even in relatively low-dimensional spaces
with perfectly accurate state transition model - Small inaccuracies/inconsistencies in
approximated value functions can cause simple
controllers to fail miserably
8Combining Value Functions With Online Search
- Instead of modeling the value function accurately
everywhere, we can perform online searches for
good trajectories from the agents current
position to compensate for value function
inaccuracies - We examine two different types of search
- Local searches in which the agent performs a
finite-depth look-ahead search - Global searches in which the agent searches for
trajectories all the way to goal states - We restrict our attention to discrete-action,
discrete-time noiseless environments
9Local Search
- Simple idea borrowed from classical game-playing
AI limited-depth lookahead. - Out of all possible d-step trajectories T,
perform the first action in the T that maximizes - reward accumulated over T V(xend),
- where xend is the state at the end of T
- Can also constrain search to trajectories T in
which action is switched at most s times.
Considerably cheaper if s ltlt d
10Local Search Examples
Local search pics s0, s1, s2
11Global Search
- If hunting for a goal state, why not just do
one big uniform-cost search from the current
state all the way to the goal? - Upside dont need a value function
12Global Search
Local Search taken to its illogical extreme
- Downside too many possible trajectories!
13Global Search Locale-Based Pruning
- Borrow a technique from Robot Motion Planning
- Partition state space into a grid
- Only search from least-cost trajectory entering
any given partition prune all others
Start of search example
14Global Search Locale-Based Pruning
- Big advantages over classical discretization-based
DP methods - Not all cells need be visited / represented
- Each cell only visited once
- If model is accurate, the solution will work one
continuous trajectory no aliasing
Completed search example
15Discretization-based algorithms
- Transform continuous-state problem into
discrete-state one - Different methods of approximating V correspond
to different methods of making this
transformation - Find V for discrete-state problem with favorite
dynamic programming method (e.g., modified policy
iteration). - If discrete problem has a solution, DP will
converge to it! - When executing in the actual continuous-state
environment, agent maps continuous states to
(possibly combinations of) discrete states and
uses the discrete states values of V.
16Simple Grid-Based Discretization
- Break state space into uniform grid
- Choose a representative state for each grid
cell - For each possible action from each rep. state,
integrate forward until you reach a different
cell - Equivalent to approximating V as a constant over
each grid cell
17Multilinear Interpolation
Value at end of trajectory is interpolated as a
weighted average of the 2d surrounding corner
points values.
Equivalent to discrete problem in which each
state/action pair results in 2d possible
successor states, with transition probabilities
equal to corresponding interpolation weights.
18Simplex-Based Interpolation
- Each grid cell implicitly decomposed into d!
simplexes based on the Kuhn triangulation - Value V at end of trajectory is interpolated as
weighted sum of the values at the (d1) corner
points of enclosing simplex. - Equivalent to discrete problem in which each
state/action pair results in (d1) possible
successor states, with transition probabilities
equal to corresponding interpolation weights.
19Global Search
Fix this pic to include goal state!
- Can be useful, but
- Still computationally expensive lots of cells
visited - Even with fine partitioning of state space,
pruning the wrong trajectories can lead to
suboptimal solutions or complete failure
20Informed Global Search
- First, use fast approximators to automatically
learn relatively crude V - Then, use V to guide global search with
locale-based pruning, much in the style of A
21Informed Global Search Examples
Fix this picture
Very approximate V (77 simplex interpolation)
More accurate V (1313 simplex interpolation)
22Acrobot
- Two-link planar robot acting in vertical plane
under gravity - Underactuated joint at elbow unactuated shoulder
- Two angular positions their velocities (4-d)
- Goal raise tip at least one links height above
shoulder - Two actions full torque clockwise /
counterclockwise - Random starting positions
- Cost total time to goal
Goal
?1
?2
23Move-Cart-Pole
- Upright pole attached to cart by unactuated joint
- State horizontal position of cart, angle of
pole, and associated velocities (4-d) - Actions accelerate left or right
- Goal configuration cart moved, pole balanced
- Start with random x ? 0
- Per-step cost quadratic in distance from goal
configuration - Big penalty if pole falls over
?
Goal configuration
x
24Planar Slider
- Puck sliding on bumpy 2-d surface
- Two spatial variables their velocities (4-d)
- Actions accelerate NW, NE, SW, or SE
- Goal in NW corner
- Random start states
- Cost total time to goal
25Local Search Experiments
Move-Cart-Pole
- CPU Time and Solution cost vs. search depth d
- No limits imposed on number of action switches
(sd) - Value function 134 simplex-interpolation grid
26Local Search Experiments
Hill-car
- CPU Time and Solution cost vs. search depth d
- Max. number of action switches fixed at 2 (s 2)
- Value function 72 simplex-interpolated value
function
27Comparative experiments Hill-Car
- Local search d6, s2
- Global searches
- Local search between grid elements d20, s1
- 502 search grid resolution
- 72 simplex-interpolated value function
28Hill-Car results contd
- Uninformed Global Search prunes wrong
trajectories - Increase search grid to 1002 so this doesnt
happen - Uninformed does near-optimal
- Informed doesnt crude value function not
optimistic
Failed search trajectory picture goes here
29Comparative Results Four-d domains
- All value functions 134 simplex interpolations
- All local searches between global search
elements - depth 20, with at max. 1 action switch (d20,
s1) - Acrobot
- Local Search depth 4 no action switch
restriction (d4,s4) - Global 504 search grid
- Move-Cart-Pole same as Acrobot
- Slider
- Local Search depth 10 max. 1 action switch
(d10,s1) - Global 204 search grid
30Acrobot
LS number of local searches performed to find
paths between elements of global search grid
- Local search significantly improves solution
quality, but increases CPU time by order of
magnitude - Uninformed global search takes even more time
poor solution quality indicates suboptimal
trajectory pruning - Informed global search finds much better
solutions in relatively little time. Value
function drastically reduces search, and better
pruning leads to better solutions
31Move-Cart-Pole
- No search pole often falls, incurring large
penalties overall poor solution quality - Local search improves things a bit
- Uninformed search finds better solutions than
informed - Few grid cells in which pruning is required
- Value function not optimistic, so informed search
solutions suboptimal - Informed search reduces costs by order of
magnitude with no increase in required CPU time
32Planar Slider
- Local search almost useless, and incurs massive
CPU expense - Uninformed search decreases solution cost by 50,
but at even greater CPU expense - Informed search decreases solution cost by factor
of 4, at no increase in CPU time
33Using Search with Learned Models
- Toy Example Hill-Car
- 72 simplex-interpolated value function
- One nearest-neighbor function approximator per
possible action used to learn dx/dt - States sufficiently far away from nearest
neighbor optimistically assumed to be absorbing
to encourage exploration - Average costs over first few hundred trials
- No search 212
- Local search 127
- Informed global search 155
34Using Search with Learned Models
- Problems do arise when using learned models
- Inaccuracies in models may cause global searches
to fail. Not clear then if failure should be
blamed on model inaccuracies or on insufficiently
fine state space partitioning - Trajectories found will be inaccurate
- Need adaptive closed-loop controller
- Fortunately, we will get new data with which to
increase the accuracy of our model - Model approximators must be fast and accurate
35Avenues for Future Research
- Extensions to nondeterministic systems?
- Higher-dimensional problems
- Better function approximators for model learning
- Variable-resolution search grids
- Optimistic value function generation?