Hierarchical Reinforcement Learning

About This Presentation
Title:

Hierarchical Reinforcement Learning

Description:

maximises expected discounted reward for a. a fully observable* Episodic MDP. ... r = accumulated discounted reward while action a was executing. Printerbot ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 55
Provided by: cse46

less

Transcript and Presenter's Notes

Title: Hierarchical Reinforcement Learning


1
Hierarchical Reinforcement Learning
A Survey and Comparison of HRL techniques
  • Mausam

2
The Outline of the Talk
  • MDPs and Bellmans curse of dimensionality.
  • RL Simultaneous learning and planning.
  • Explore avenues to speed up RL.
  • Illustrate prominent HRL methods.
  • Compare prominent HRL methods.
  • Discuss future research.
  • Summarise

3
Decision Making
Slide courtesy Dan Weld
4
Personal Printerbot
  • States (S) loc,has-robot-printout,
    user-loc,has-user-printout,map
  • Actions (A) moven,moves,movee,movew,
    extend-arm,grab-page,release-pages
  • Reward (R) if h-u-po 20 else -1
  • Goal (G) All states with h-u-po true.
  • Start state A state with h-u-po false.

5
Episodic Markov Decision Process
Episodic MDP MDP with absorbing goals
  • hS, A, P, R, G, s0i
  • S Set of environment states.
  • A Set of available actions.
  • P Probability Transition model. P(ss,a)
  • R Reward model. R(s)
  • G Absorbing goal states.
  • s0 Start state.
  • ? Discount factor.

Markovian assumption. bounds R for
infinite horizon.
6
Goal of an Episodic MDP
  • Find a policy (S ! A), which
  • maximises expected discounted reward for a
  • a fully observable Episodic MDP.
  • if agent is allowed to execute for an indefinite
    horizon.

Non-noisy complete information
perceptors
7
Solution of an Episodic MDP
  • Define V(s) Optimal reward starting in state
    s.
  • Value Iteration Start with an estimate of V(s)
    and successively re-estimate it to
    converge to a fixed point.

8
Complexity of Value Iteration
  • Each iteration polynomial in S
  • Number of iterations polynomial in S
  • Overall polynomial in S
  • Polynomial in S - ???
  • S exponential in number of
  • features in the domain.

Bellmans curse of dimensionality
9
The Outline of the Talk
  • MDPs and Bellmans curse of dimensionality.
  • RL Simultaneous learning and planning.
  • Explore avenues to speed up RL.
  • Illustrate prominent HRL methods.
  • Compare prominent HRL methods.
  • Discuss future research.
  • Summarise

10
Learning
Environment
Data
11
Decision Making while Learning
Environment
Percepts Datum
Action
Known as Reinforcement Learning
12
Reinforcement Learning
  • Unknown P and reward R.
  • Learning Component Estimate the P and R values
    via data observed from the environment.
  • Planning Component Decide which actions to take
    that will maximise reward.
  • Exploration vs. Exploitation
  • GLIE (Greedy in Limit with
  • Infinite Exploration)

13
Planning vs. MDP vs. RL
  • MDP model system dynamics.
  • MDP algorithms solve the optimisation equations.
  • Planning is modeled as MDPs.
  • Planning algorithms speed up MDP algorithms.
  • RL is modeled over MDPs.
  • RL algorithms use MDP equations as basis.
  • RL algorithms speed up algorithms for
    simultaneous planning and learning.

14
Exploration vs. Exploitation
  • Exploration Choose actions that visit new
    states in order to obtain more data for better
    learning.
  • Exploitation Choose actions that maximise the
    reward given current learnt model.
  • A solution GLIE - Greedy in Limit
  • with Infinite Exploration.

15
Model Based Learning
  • First learn the model.
  • Then use MDP algorithms.
  • Very slow, and uses a lot of data.
  • Optimisations proposed DYNA, Prioritised
    Sweeping etc.
  • Uses less data, comparitively slow.

16
Model Free Learning
  • Learn the policy without learning an explicit
    model.
  • Do not estimate P, and R explicitly.
  • E.g. Temporal Difference Learning.
  • Very popular, fast, require a lot of data.

17
Learning
  • Model-based learning
  • Learn the model, and do planning
  • Requires less data, more computation
  • Model-free learning
  • Plan without learning an explicit model
  • Requires a lot of data, less computation

18
Q-Learning
  • Instead of learning, P and R, learn Q directly.
  • Q(s,a) Optimal reward starting in s,
    if the first action is a, and
    after that the optimal policy is followed.
  • Q directly defines the optimal policy

Optimal policy is the action with maximum Q
value.
19
Q-Learning
  • Given an experience tuple hs,a,s,ri
  • Under suitable assumptions, and GLIE exploration
    Q-Learning converges to
    optimal.

New estimate of Q value
Old estimate of Q value
20
Semi-MDP When actions take time.
  • The Semi-MDP equation
  • Semi-MDP Q-Learning equation
  • where experience tuple is hs,a,s,r,Ni
  • r accumulated discounted reward
    while action a was
    executing.

21
Printerbot
  • Paul G. Allen Center has 85000 sq ft space
  • Each floor 85000/7 12000 sq ft
  • Discretise location on a floor 12000 parts.
  • State Space (without map) 221200012000 ---
    very large!!!!!
  • How do humans do the
    decision making?

22
The Outline of the Talk
  • MDPs and Bellmans curse of dimensionality.
  • RL Simultaneous learning and planning.
  • Explore avenues to speedup RL.
  • Illustrate prominent HRL methods.
  • Compare prominent HRL methods.
  • Discuss future research.
  • Summarise

23
1. The Mathematical PerspectiveA Structure
Paradigm
  • S Relational MDP
  • A Concurrent MDP
  • P Dynamic Bayes Nets
  • R Continuous-state MDP
  • G Conjunction of state variables
  • V Algebraic Decision Diagrams
  • ? Decision List (RMDP)

24
2. Modular Decision Making
25
2. Modular Decision Making
  • Go out of room
  • Walk in hallway
  • Go in the room

26
2. Modular Decision Making
  • Humans plan modularly at different granularities
    of understanding.
  • Going out of one room is similar to going out of
    another room.
  • Navigation steps do not depend on whether we have
    the print out or not.

27
3. Background Knowledge
  • Classical Planners using additional control
    knowledge can scale up to larger problems.
  • (E.g. HTN planning, TLPlan)
  • What forms of control knowledge can we provide to
    our Printerbot?
  • First pick printouts, then deliver them.
  • Navigation consider rooms, hallway, separately,
    etc.

28
A mechanism that exploits all three avenues
Hierarchies
  1. Way to add a special (hierarchical) structure on
    different parameters of an MDP.
  2. Draws from the intuition and reasoning in human
    decision making.
  3. Way to provide additional control knowledge to
    the system.

29
The Outline of the Talk
  • MDPs and Bellmans curse of dimensionality.
  • RL Simultaneous learning and planning.
  • Explore avenues to speedup RL.
  • Illustrate prominent HRL methods.
  • Compare prominent HRL methods.
  • Discuss future research.
  • Summarise

30
Hierarchy
  • Hierarchy of Behaviour, Skill, Module,
    SubTask, Macro-action, etc.
  • picking the pages
  • collision avoidance
  • fetch pages phase
  • walk in hallway
  • HRL RL with temporally
    extended actions

31
Hierarchical Algos Gating Mechanism
  • Hierarchical Learning
  • Learning the gating function
  • Learning the individual behaviours
  • Learning both


g is a gate
bi is a behaviour
Can be a multi- level hierarchy.
32
Option Movee until end of hallway
  • Start Any state in the hallway.
  • Execute policy as shown.
  • Terminate when s is end of hallway.

33
Options Sutton, Precup, Singh99
  • An option is a well defined behaviour.
  • o h Io, ?o, ?o i
  • Io Set of states (IoµS) in which o can be
    initiated.
  • ?o(s) Policy (S!A) when o is executing.
  • ?o(s) Probability that o terminates
    in s.

Can be a policy over lower level options.
34
Learning
  • An option is temporally extended action with well
    defined policy.
  • Set of options (O) replaces the set of actions
    (A)
  • Learning occurs outside options.
  • Learning over options Semi MDP Q-Learning.

35
Machine Movee Collision Avoidance
End of hallway
Call M1
Movee
Choose
Obstacle
Call M2
End of hallway
Return
M1
M2
36
Hierarchies of Abstract MachinesParr,
Russell97
  • A machine is a partial policy represented by a
    Finite State Automaton.
  • Node
  • Execute a ground action.
  • Call a machine as a subroutine.
  • Choose the next node.
  • Return to the calling machine.

37
Hierarchies of Abstract Machines
  • A machine is a partial policy represented by a
    Finite State Automaton.
  • Node
  • Execute a ground action.
  • Call a machine as subroutine.
  • Choose the next node.
  • Return to the calling machine.

38
Learning
  • Learning occurs within machines, as machines are
    only partially defined.
  • Flatten all machines out and consider states
    s,m where s is a world state, and m, a machine
    node MDP
  • reduce(SoM) Consider only states where machine
    node is a choice node Semi-MDP.
  • Learning ¼ Semi-MDP Q-Learning

39
Task Hierarchy MAXQ DecompositionDietterich00
Root
Children of a task are unordered
Deliver
Fetch
Take
Give
Navigate(loc)
Extend-arm
Extend-arm
Grab
Release
Movee
Movew
Moves
Moven
40
MAXQ Decomposition
  • Augment the state s by adding the subtask
    i s,i.
  • Define C(s,i,j) as the reward received in i
    after j finishes.
  • Q(s,Fetch,Navigate(prr)) V(s,Navigate(prr))
    C(s,Fetch,Navigate(prr))
  • Express V in terms of C
  • Learn C, instead of learning Q

Reward received while navigating
Reward received after navigation
Observe the context-free nature of Q-value
41
The Outline of the Talk
  • MDPs and Bellmans curse of dimensionality.
  • RL Simultaneous learning and planning.
  • Explore avenues to speedup RL.
  • Illustrate prominent HRL methods.
  • Compare prominent HRL methods.
  • Discuss future research.
  • Summarise

42
1. State Abstraction
  • Abstract state A state having fewer state
    variables different world states maps to the
    same abstract state.
  • If we can reduce some state variables, then we
    can reduce on the learning time considerably!
  • We may use different abstract states for
    different macro-actions.

43
State Abstraction in MAXQ
  • Relevance Only some variables are
    relevant for the task.
  • Fetch user-loc irrelevant
  • Navigate(printer-room) h-r-po,h-u-po,user-loc
  • Fewer params for V of lower levels.
  • Funnelling Subtask maps many states to
    smaller set of states.
  • Fetch All states map to h-r-potrue,
    locpr.room.
  • Fewer params for C of higher levels.

44
State Abstraction in Options, HAM
  • Options Learning required only in states that
    are terminal states for some option.
  • HAM Original work has no abstraction.
  • Extension Three-way value decomposition
  • Q(s,m,n) V(s,n) C(s,m,n) Cex(s,m)
  • Similar abstractions are employed.

Andre,Russell02
45
2. Optimality
Hierarchical Optimality vs. Recursive
Optimality
46
Optimality
  • Options Hierarchical
  • Use (A O) Global
  • Interrupt options
  • HAM Hierarchical
  • MAXQ Recursive
  • Interrupt subtasks
  • Use Pseudo-rewards
  • Iterate!

Can define eqns for both optimalities Adv.
of using macro-actions maybe lost.
47
3. Language Expressiveness
  • Option
  • Can only input a complete policy
  • HAM
  • Can input a complete policy.
  • Can input a task hierarchy.
  • Can represent amount of effort.
  • Later extended to partial programs.
  • MAXQ
  • Cannot input a policy (full/partial)

48
4. Knowledge Requirements
  • Options
  • Requires complete specification of policy.
  • One could learn option policies given subtasks.
  • HAM
  • Medium requirements
  • MAXQ
  • Minimal requirements

49
5. Models advanced
  • Options Concurrency
  • HAM Richer representation, Concurrency
  • MAXQ Continuous time, state, actions
    Multi-agents, Average-reward.
  • In general, more researchers have followed MAXQ
  • Less input knowledge
  • Value decomposition

50
6. Structure Paradigm
  • S Options, MAXQ
  • A All
  • P None
  • R MAXQ
  • G All
  • V MAXQ
  • ? All

51
The Outline of the Talk
  • MDPs and Bellmans curse of dimensionality.
  • RL Simultaneous learning and planning.
  • Explore avenues to speedup RL.
  • Illustrate prominent HRL methods.
  • Compare prominent HRL methods.
  • Discuss future research.
  • Summarise

52
Directions for Future Research
  • Bidirectional State Abstractions
  • Hierarchies over other RL research
  • Model based methods
  • Function Approximators
  • Probabilistic Planning
  • Hierarchical P and Hierarchical R
  • Imitation Learning

53
Directions for Future Research
  • Theory
  • Bounds (goodness of hierarchy)
  • Non-asymptotic analysis
  • Automated Discovery
  • Discovery of Hierarchies
  • Discovery of State Abstraction
  • Apply

54
Applications
  • Toy Robot
  • Flight Simulator
  • AGV Scheduling
  • Keepaway soccer

Images courtesy various sources
55
Thinking Big
  • "... consider maze domains. Reinforcement
    learning researchers, including this author, have
    spent countless years of research solving a
    solved problem! Navigating in grid worlds, even
    with stochastic dynamics, has been far from
    rocket science since the advent of search
    techniques such as A.
    -- David Andre
  • Use planners, theorem provers, etc. as components
    in big hierarchical solver.

56
The Outline of the Talk
  • MDPs and Bellmans curse of dimensionality.
  • RL Simultaneous learning and planning.
  • Explore avenues to speedup RL.
  • Illustrate prominent HRL methods.
  • Compare prominent HRL methods.
  • Discuss future research.
  • Summarise

57
How to choose appropriate hierarchy
  • Look at available domain knowledge
  • If some behaviours are completely specified
    options
  • If some behaviours are partially specified HAM
  • If less domain knowledge available MAXQ
  • We can use all three to specify different
    behaviours in tandem.

58
The Structure Paradigm
  • Organised way to view optimisations.
  • Assists in figuring out unexploited avenues for
    speedup.

59
Main ideas in HRL community
  • Hierarchies speedup learning
  • Value function decomposition
  • State Abstractions
  • Greedy non-hierarchical execution
  • Context-free learning and pseudo-rewards
  • Policy improvement by re-estimation and
    re-learning.
Write a Comment
User Comments (0)