Title: Hierarchical Reinforcement Learning
1Hierarchical Reinforcement Learning
A Survey and Comparison of HRL techniques
2The Outline of the Talk
- MDPs and Bellmans curse of dimensionality.
- RL Simultaneous learning and planning.
- Explore avenues to speed up RL.
- Illustrate prominent HRL methods.
- Compare prominent HRL methods.
- Discuss future research.
- Summarise
3Decision Making
Slide courtesy Dan Weld
4Personal Printerbot
- States (S) loc,has-robot-printout,
user-loc,has-user-printout,map - Actions (A) moven,moves,movee,movew,
extend-arm,grab-page,release-pages - Reward (R) if h-u-po 20 else -1
- Goal (G) All states with h-u-po true.
- Start state A state with h-u-po false.
5Episodic Markov Decision Process
Episodic MDP MDP with absorbing goals
- hS, A, P, R, G, s0i
- S Set of environment states.
- A Set of available actions.
- P Probability Transition model. P(ss,a)
- R Reward model. R(s)
- G Absorbing goal states.
- s0 Start state.
- ? Discount factor.
Markovian assumption. bounds R for
infinite horizon.
6Goal of an Episodic MDP
- Find a policy (S ! A), which
- maximises expected discounted reward for a
- a fully observable Episodic MDP.
- if agent is allowed to execute for an indefinite
horizon.
Non-noisy complete information
perceptors
7Solution of an Episodic MDP
- Define V(s) Optimal reward starting in state
s. - Value Iteration Start with an estimate of V(s)
and successively re-estimate it to
converge to a fixed point.
8Complexity of Value Iteration
- Each iteration polynomial in S
- Number of iterations polynomial in S
- Overall polynomial in S
- Polynomial in S - ???
- S exponential in number of
- features in the domain.
Bellmans curse of dimensionality
9The Outline of the Talk
- MDPs and Bellmans curse of dimensionality.
- RL Simultaneous learning and planning.
- Explore avenues to speed up RL.
- Illustrate prominent HRL methods.
- Compare prominent HRL methods.
- Discuss future research.
- Summarise
10Learning
Environment
Data
11Decision Making while Learning
Environment
Percepts Datum
Action
Known as Reinforcement Learning
12Reinforcement Learning
- Unknown P and reward R.
- Learning Component Estimate the P and R values
via data observed from the environment. - Planning Component Decide which actions to take
that will maximise reward. - Exploration vs. Exploitation
- GLIE (Greedy in Limit with
- Infinite Exploration)
13Planning vs. MDP vs. RL
- MDP model system dynamics.
- MDP algorithms solve the optimisation equations.
- Planning is modeled as MDPs.
- Planning algorithms speed up MDP algorithms.
- RL is modeled over MDPs.
- RL algorithms use MDP equations as basis.
- RL algorithms speed up algorithms for
simultaneous planning and learning.
14Exploration vs. Exploitation
- Exploration Choose actions that visit new
states in order to obtain more data for better
learning. - Exploitation Choose actions that maximise the
reward given current learnt model. - A solution GLIE - Greedy in Limit
- with Infinite Exploration.
15Model Based Learning
- First learn the model.
- Then use MDP algorithms.
- Very slow, and uses a lot of data.
- Optimisations proposed DYNA, Prioritised
Sweeping etc. - Uses less data, comparitively slow.
16Model Free Learning
- Learn the policy without learning an explicit
model. - Do not estimate P, and R explicitly.
- E.g. Temporal Difference Learning.
- Very popular, fast, require a lot of data.
17Learning
- Model-based learning
- Learn the model, and do planning
- Requires less data, more computation
- Model-free learning
- Plan without learning an explicit model
- Requires a lot of data, less computation
18Q-Learning
- Instead of learning, P and R, learn Q directly.
- Q(s,a) Optimal reward starting in s,
if the first action is a, and
after that the optimal policy is followed. - Q directly defines the optimal policy
Optimal policy is the action with maximum Q
value.
19Q-Learning
- Given an experience tuple hs,a,s,ri
- Under suitable assumptions, and GLIE exploration
Q-Learning converges to
optimal.
New estimate of Q value
Old estimate of Q value
20Semi-MDP When actions take time.
- The Semi-MDP equation
- Semi-MDP Q-Learning equation
- where experience tuple is hs,a,s,r,Ni
- r accumulated discounted reward
while action a was
executing.
21Printerbot
- Paul G. Allen Center has 85000 sq ft space
- Each floor 85000/7 12000 sq ft
- Discretise location on a floor 12000 parts.
- State Space (without map) 221200012000 ---
very large!!!!! - How do humans do the
decision making?
22The Outline of the Talk
- MDPs and Bellmans curse of dimensionality.
- RL Simultaneous learning and planning.
- Explore avenues to speedup RL.
- Illustrate prominent HRL methods.
- Compare prominent HRL methods.
- Discuss future research.
- Summarise
231. The Mathematical PerspectiveA Structure
Paradigm
- S Relational MDP
- A Concurrent MDP
- P Dynamic Bayes Nets
- R Continuous-state MDP
- G Conjunction of state variables
- V Algebraic Decision Diagrams
- ? Decision List (RMDP)
242. Modular Decision Making
252. Modular Decision Making
- Go out of room
- Walk in hallway
- Go in the room
262. Modular Decision Making
- Humans plan modularly at different granularities
of understanding. - Going out of one room is similar to going out of
another room. - Navigation steps do not depend on whether we have
the print out or not.
273. Background Knowledge
- Classical Planners using additional control
knowledge can scale up to larger problems. - (E.g. HTN planning, TLPlan)
- What forms of control knowledge can we provide to
our Printerbot? - First pick printouts, then deliver them.
- Navigation consider rooms, hallway, separately,
etc.
28A mechanism that exploits all three avenues
Hierarchies
- Way to add a special (hierarchical) structure on
different parameters of an MDP. - Draws from the intuition and reasoning in human
decision making. - Way to provide additional control knowledge to
the system.
29The Outline of the Talk
- MDPs and Bellmans curse of dimensionality.
- RL Simultaneous learning and planning.
- Explore avenues to speedup RL.
- Illustrate prominent HRL methods.
- Compare prominent HRL methods.
- Discuss future research.
- Summarise
30Hierarchy
- Hierarchy of Behaviour, Skill, Module,
SubTask, Macro-action, etc. - picking the pages
- collision avoidance
- fetch pages phase
- walk in hallway
- HRL RL with temporally
extended actions
31Hierarchical Algos Gating Mechanism
- Hierarchical Learning
- Learning the gating function
- Learning the individual behaviours
- Learning both
g is a gate
bi is a behaviour
Can be a multi- level hierarchy.
32Option Movee until end of hallway
- Start Any state in the hallway.
- Execute policy as shown.
- Terminate when s is end of hallway.
33Options Sutton, Precup, Singh99
- An option is a well defined behaviour.
- o h Io, ?o, ?o i
- Io Set of states (IoµS) in which o can be
initiated. - ?o(s) Policy (S!A) when o is executing.
- ?o(s) Probability that o terminates
in s.
Can be a policy over lower level options.
34Learning
- An option is temporally extended action with well
defined policy. - Set of options (O) replaces the set of actions
(A) - Learning occurs outside options.
- Learning over options Semi MDP Q-Learning.
35Machine Movee Collision Avoidance
End of hallway
Call M1
Movee
Choose
Obstacle
Call M2
End of hallway
Return
M1
M2
36Hierarchies of Abstract MachinesParr,
Russell97
- A machine is a partial policy represented by a
Finite State Automaton. - Node
- Execute a ground action.
- Call a machine as a subroutine.
- Choose the next node.
- Return to the calling machine.
37Hierarchies of Abstract Machines
- A machine is a partial policy represented by a
Finite State Automaton. - Node
- Execute a ground action.
- Call a machine as subroutine.
- Choose the next node.
- Return to the calling machine.
38Learning
- Learning occurs within machines, as machines are
only partially defined. - Flatten all machines out and consider states
s,m where s is a world state, and m, a machine
node MDP - reduce(SoM) Consider only states where machine
node is a choice node Semi-MDP. - Learning ¼ Semi-MDP Q-Learning
39Task Hierarchy MAXQ DecompositionDietterich00
Root
Children of a task are unordered
Deliver
Fetch
Take
Give
Navigate(loc)
Extend-arm
Extend-arm
Grab
Release
Movee
Movew
Moves
Moven
40MAXQ Decomposition
- Augment the state s by adding the subtask
i s,i. - Define C(s,i,j) as the reward received in i
after j finishes. - Q(s,Fetch,Navigate(prr)) V(s,Navigate(prr))
C(s,Fetch,Navigate(prr)) - Express V in terms of C
- Learn C, instead of learning Q
Reward received while navigating
Reward received after navigation
Observe the context-free nature of Q-value
41The Outline of the Talk
- MDPs and Bellmans curse of dimensionality.
- RL Simultaneous learning and planning.
- Explore avenues to speedup RL.
- Illustrate prominent HRL methods.
- Compare prominent HRL methods.
- Discuss future research.
- Summarise
421. State Abstraction
- Abstract state A state having fewer state
variables different world states maps to the
same abstract state. - If we can reduce some state variables, then we
can reduce on the learning time considerably! - We may use different abstract states for
different macro-actions.
43State Abstraction in MAXQ
- Relevance Only some variables are
relevant for the task. - Fetch user-loc irrelevant
- Navigate(printer-room) h-r-po,h-u-po,user-loc
- Fewer params for V of lower levels.
- Funnelling Subtask maps many states to
smaller set of states. - Fetch All states map to h-r-potrue,
locpr.room. - Fewer params for C of higher levels.
44State Abstraction in Options, HAM
- Options Learning required only in states that
are terminal states for some option. - HAM Original work has no abstraction.
- Extension Three-way value decomposition
- Q(s,m,n) V(s,n) C(s,m,n) Cex(s,m)
- Similar abstractions are employed.
Andre,Russell02
452. Optimality
Hierarchical Optimality vs. Recursive
Optimality
46Optimality
- Options Hierarchical
- Use (A O) Global
- Interrupt options
- HAM Hierarchical
- MAXQ Recursive
- Interrupt subtasks
- Use Pseudo-rewards
- Iterate!
Can define eqns for both optimalities Adv.
of using macro-actions maybe lost.
473. Language Expressiveness
- Option
- Can only input a complete policy
- HAM
- Can input a complete policy.
- Can input a task hierarchy.
- Can represent amount of effort.
- Later extended to partial programs.
- MAXQ
- Cannot input a policy (full/partial)
484. Knowledge Requirements
- Options
- Requires complete specification of policy.
- One could learn option policies given subtasks.
- HAM
- Medium requirements
- MAXQ
- Minimal requirements
495. Models advanced
- Options Concurrency
- HAM Richer representation, Concurrency
- MAXQ Continuous time, state, actions
Multi-agents, Average-reward. - In general, more researchers have followed MAXQ
- Less input knowledge
- Value decomposition
506. Structure Paradigm
- S Options, MAXQ
- A All
- P None
- R MAXQ
- G All
- V MAXQ
- ? All
51The Outline of the Talk
- MDPs and Bellmans curse of dimensionality.
- RL Simultaneous learning and planning.
- Explore avenues to speedup RL.
- Illustrate prominent HRL methods.
- Compare prominent HRL methods.
- Discuss future research.
- Summarise
52Directions for Future Research
- Bidirectional State Abstractions
- Hierarchies over other RL research
- Model based methods
- Function Approximators
- Probabilistic Planning
- Hierarchical P and Hierarchical R
- Imitation Learning
53Directions for Future Research
- Theory
- Bounds (goodness of hierarchy)
- Non-asymptotic analysis
- Automated Discovery
- Discovery of Hierarchies
- Discovery of State Abstraction
- Apply
54Applications
- Toy Robot
- Flight Simulator
- AGV Scheduling
- Keepaway soccer
Images courtesy various sources
55Thinking Big
- "... consider maze domains. Reinforcement
learning researchers, including this author, have
spent countless years of research solving a
solved problem! Navigating in grid worlds, even
with stochastic dynamics, has been far from
rocket science since the advent of search
techniques such as A.
-- David Andre - Use planners, theorem provers, etc. as components
in big hierarchical solver.
56The Outline of the Talk
- MDPs and Bellmans curse of dimensionality.
- RL Simultaneous learning and planning.
- Explore avenues to speedup RL.
- Illustrate prominent HRL methods.
- Compare prominent HRL methods.
- Discuss future research.
- Summarise
57How to choose appropriate hierarchy
- Look at available domain knowledge
- If some behaviours are completely specified
options - If some behaviours are partially specified HAM
- If less domain knowledge available MAXQ
- We can use all three to specify different
behaviours in tandem.
58The Structure Paradigm
- Organised way to view optimisations.
- Assists in figuring out unexploited avenues for
speedup.
59Main ideas in HRL community
- Hierarchies speedup learning
- Value function decomposition
- State Abstractions
- Greedy non-hierarchical execution
- Context-free learning and pseudo-rewards
- Policy improvement by re-estimation and
re-learning.