Applying Online Search Techniques to Continuous-State Reinforcement Learning

1 / 35
About This Presentation
Title:

Applying Online Search Techniques to Continuous-State Reinforcement Learning

Description:

Global Search: Locale-Based Pruning. Borrow a technique from Robot ... Then, use V to guide global search with locale-based pruning, much in the style of A ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 36
Provided by: scottd153
Learn more at: http://www.cs.cmu.edu

less

Transcript and Presenter's Notes

Title: Applying Online Search Techniques to Continuous-State Reinforcement Learning


1
Applying Online Search Techniques to
Continuous-State Reinforcement Learning
  • Scott Davies, Andrew Ng, and Andrew Moore
  • Carnegie Mellon University

2
Discrete Reinforcement Learning
  • Agent can be in one of a finite number of
    discrete states xi
  • Agent gets to choose a discrete action a at each
    time step
  • Action affects state transition probabilities
    between current and next time steps
  • Environment punishes/rewards agent depending on
    current state and action
  • Goal maximize long-term reward


3
Solving Discrete RL Problems
  • Well-known methods exist for solving such
    problems with either known or unknown state
    transitions and rewards
  • Value iteration, policy iteration, and modified
    policy iteration can be used with either a priori
    or learned models of the problem dynamics
  • Model-free methods such as Q-learning and TD-l
    can be used straightforwardly in either case

4
Value Functions
  • All methods learn a value function

where V(x) is the discounted total reward the
agent will accrue if it acts optimally starting
at state x.
Or, equivalently (sort of), a Q-function, Q(x,a)
where Q(x,a) is the discounted total reward the
agent will accrue if it acts optimally after
taking action a in state x.
5
Continuous-State RL
  • Naturally, wed also like to solve RL problems
    with continuous states.
  • Problem V(x) must be approximated

hillcar V
6
Approximate Value Functions
  • Constant grid

Simplex interpolation
7
The Agony of Continuous State Spaces
  • Learning useful value functions for
    continuous-state optimal control problems can be
    difficult!
  • Accurate value functions can be very expensive to
    compute even in relatively low-dimensional spaces
    with perfectly accurate state transition model
  • Small inaccuracies/inconsistencies in
    approximated value functions can cause simple
    controllers to fail miserably

8
Combining Value Functions With Online Search
  • Instead of modeling the value function accurately
    everywhere, we can perform online searches for
    good trajectories from the agents current
    position to compensate for value function
    inaccuracies
  • We examine two different types of search
  • Local searches in which the agent performs a
    finite-depth look-ahead search
  • Global searches in which the agent searches for
    trajectories all the way to goal states
  • We restrict our attention to discrete-action,
    discrete-time noiseless environments

9
Local Search
  • Simple idea borrowed from classical game-playing
    AI limited-depth lookahead.
  • Out of all possible d-step trajectories T,
    perform the first action in the T that maximizes
  • reward accumulated over T V(xend),
  • where xend is the state at the end of T
  • Can also constrain search to trajectories T in
    which action is switched at most s times.
    Considerably cheaper if s ltlt d

10
Local Search Examples
Local search pics s0, s1, s2
11
Global Search
  • If hunting for a goal state, why not just do
    one big uniform-cost search from the current
    state all the way to the goal?
  • Upside dont need a value function

12
Global Search
Local Search taken to its illogical extreme
  • Downside too many possible trajectories!

13
Global Search Locale-Based Pruning
  • Borrow a technique from Robot Motion Planning
  • Partition state space into a grid
  • Only search from least-cost trajectory entering
    any given partition prune all others

Start of search example
14
Global Search Locale-Based Pruning
  • Big advantages over classical discretization-based
    DP methods
  • Not all cells need be visited / represented
  • Each cell only visited once
  • If model is accurate, the solution will work one
    continuous trajectory no aliasing

Completed search example
15
Discretization-based algorithms
  • Transform continuous-state problem into
    discrete-state one
  • Different methods of approximating V correspond
    to different methods of making this
    transformation
  • Find V for discrete-state problem with favorite
    dynamic programming method (e.g., modified policy
    iteration).
  • If discrete problem has a solution, DP will
    converge to it!
  • When executing in the actual continuous-state
    environment, agent maps continuous states to
    (possibly combinations of) discrete states and
    uses the discrete states values of V.

16
Simple Grid-Based Discretization
  • Break state space into uniform grid
  • Choose a representative state for each grid
    cell
  • For each possible action from each rep. state,
    integrate forward until you reach a different
    cell
  • Equivalent to approximating V as a constant over
    each grid cell

17
Multilinear Interpolation
Value at end of trajectory is interpolated as a
weighted average of the 2d surrounding corner
points values.
Equivalent to discrete problem in which each
state/action pair results in 2d possible
successor states, with transition probabilities
equal to corresponding interpolation weights.
18
Simplex-Based Interpolation
  • Each grid cell implicitly decomposed into d!
    simplexes based on the Kuhn triangulation
  • Value V at end of trajectory is interpolated as
    weighted sum of the values at the (d1) corner
    points of enclosing simplex.
  • Equivalent to discrete problem in which each
    state/action pair results in (d1) possible
    successor states, with transition probabilities
    equal to corresponding interpolation weights.

19
Global Search
Fix this pic to include goal state!
  • Can be useful, but
  • Still computationally expensive lots of cells
    visited
  • Even with fine partitioning of state space,
    pruning the wrong trajectories can lead to
    suboptimal solutions or complete failure

20
Informed Global Search
  • First, use fast approximators to automatically
    learn relatively crude V
  • Then, use V to guide global search with
    locale-based pruning, much in the style of A

21
Informed Global Search Examples
Fix this picture
Very approximate V (77 simplex interpolation)
More accurate V (1313 simplex interpolation)
22
Acrobot
  • Two-link planar robot acting in vertical plane
    under gravity
  • Underactuated joint at elbow unactuated shoulder
  • Two angular positions their velocities (4-d)
  • Goal raise tip at least one links height above
    shoulder
  • Two actions full torque clockwise /
    counterclockwise
  • Random starting positions
  • Cost total time to goal

Goal
?1
?2
23
Move-Cart-Pole
  • Upright pole attached to cart by unactuated joint
  • State horizontal position of cart, angle of
    pole, and associated velocities (4-d)
  • Actions accelerate left or right
  • Goal configuration cart moved, pole balanced
  • Start with random x ? 0
  • Per-step cost quadratic in distance from goal
    configuration
  • Big penalty if pole falls over

?
Goal configuration
x
24
Planar Slider
  • Puck sliding on bumpy 2-d surface
  • Two spatial variables their velocities (4-d)
  • Actions accelerate NW, NE, SW, or SE
  • Goal in NW corner
  • Random start states
  • Cost total time to goal

25
Local Search Experiments
Move-Cart-Pole
  • CPU Time and Solution cost vs. search depth d
  • No limits imposed on number of action switches
    (sd)
  • Value function 134 simplex-interpolation grid

26
Local Search Experiments
Hill-car
  • CPU Time and Solution cost vs. search depth d
  • Max. number of action switches fixed at 2 (s 2)
  • Value function 72 simplex-interpolated value
    function

27
Comparative experiments Hill-Car
  • Local search d6, s2
  • Global searches
  • Local search between grid elements d20, s1
  • 502 search grid resolution
  • 72 simplex-interpolated value function

28
Hill-Car results contd
  • Uninformed Global Search prunes wrong
    trajectories
  • Increase search grid to 1002 so this doesnt
    happen
  • Uninformed does near-optimal
  • Informed doesnt crude value function not
    optimistic

Failed search trajectory picture goes here
29
Comparative Results Four-d domains
  • All value functions 134 simplex interpolations
  • All local searches between global search
    elements
  • depth 20, with at max. 1 action switch (d20,
    s1)
  • Acrobot
  • Local Search depth 4 no action switch
    restriction (d4,s4)
  • Global 504 search grid
  • Move-Cart-Pole same as Acrobot
  • Slider
  • Local Search depth 10 max. 1 action switch
    (d10,s1)
  • Global 204 search grid

30
Acrobot
LS number of local searches performed to find
paths between elements of global search grid
  • Local search significantly improves solution
    quality, but increases CPU time by order of
    magnitude
  • Uninformed global search takes even more time
    poor solution quality indicates suboptimal
    trajectory pruning
  • Informed global search finds much better
    solutions in relatively little time. Value
    function drastically reduces search, and better
    pruning leads to better solutions

31
Move-Cart-Pole
  • No search pole often falls, incurring large
    penalties overall poor solution quality
  • Local search improves things a bit
  • Uninformed search finds better solutions than
    informed
  • Few grid cells in which pruning is required
  • Value function not optimistic, so informed search
    solutions suboptimal
  • Informed search reduces costs by order of
    magnitude with no increase in required CPU time

32
Planar Slider
  • Local search almost useless, and incurs massive
    CPU expense
  • Uninformed search decreases solution cost by 50,
    but at even greater CPU expense
  • Informed search decreases solution cost by factor
    of 4, at no increase in CPU time

33
Using Search with Learned Models
  • Toy Example Hill-Car
  • 72 simplex-interpolated value function
  • One nearest-neighbor function approximator per
    possible action used to learn dx/dt
  • States sufficiently far away from nearest
    neighbor optimistically assumed to be absorbing
    to encourage exploration
  • Average costs over first few hundred trials
  • No search 212
  • Local search 127
  • Informed global search 155

34
Using Search with Learned Models
  • Problems do arise when using learned models
  • Inaccuracies in models may cause global searches
    to fail. Not clear then if failure should be
    blamed on model inaccuracies or on insufficiently
    fine state space partitioning
  • Trajectories found will be inaccurate
  • Need adaptive closed-loop controller
  • Fortunately, we will get new data with which to
    increase the accuracy of our model
  • Model approximators must be fast and accurate

35
Avenues for Future Research
  • Extensions to nondeterministic systems?
  • Higher-dimensional problems
  • Better function approximators for model learning
  • Variable-resolution search grids
  • Optimistic value function generation?
Write a Comment
User Comments (0)