Hierarchical Reinforcement Learning

About This Presentation

Title:

Hierarchical Reinforcement Learning

Description:

maximises expected discounted reward for a. a fully observable* Episodic MDP. ... r = accumulated discounted reward while action a was executing. Printerbot ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 55

Provided by: cse46

more less

Transcript and Presenter's Notes

Title: Hierarchical Reinforcement Learning

1
Hierarchical Reinforcement Learning
A Survey and Comparison of HRL techniques

Mausam

2
The Outline of the Talk

MDPs and Bellmans curse of dimensionality.
RL Simultaneous learning and planning.
Explore avenues to speed up RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise

3
Decision Making
Slide courtesy Dan Weld
4
Personal Printerbot

States (S) loc,has-robot-printout,
user-loc,has-user-printout,map
Actions (A) moven,moves,movee,movew,
extend-arm,grab-page,release-pages
Reward (R) if h-u-po 20 else -1
Goal (G) All states with h-u-po true.
Start state A state with h-u-po false.

5
Episodic Markov Decision Process
Episodic MDP MDP with absorbing goals

hS, A, P, R, G, s0i
S Set of environment states.
A Set of available actions.
P Probability Transition model. P(ss,a)
R Reward model. R(s)
G Absorbing goal states.
s0 Start state.
? Discount factor.

Markovian assumption. bounds R for
infinite horizon.
6
Goal of an Episodic MDP

Find a policy (S ! A), which
maximises expected discounted reward for a
a fully observable Episodic MDP.
if agent is allowed to execute for an indefinite
horizon.

Non-noisy complete information
perceptors
7
Solution of an Episodic MDP

Define V(s) Optimal reward starting in state
s.
Value Iteration Start with an estimate of V(s)
and successively re-estimate it to
converge to a fixed point.

8
Complexity of Value Iteration

Each iteration polynomial in S
Number of iterations polynomial in S
Overall polynomial in S
Polynomial in S - ???
S exponential in number of
features in the domain.

Bellmans curse of dimensionality
9
The Outline of the Talk

MDPs and Bellmans curse of dimensionality.
RL Simultaneous learning and planning.
Explore avenues to speed up RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise

10
Learning
Environment
Data
11
Decision Making while Learning
Environment
Percepts Datum
Action
Known as Reinforcement Learning
12
Reinforcement Learning

Unknown P and reward R.
Learning Component Estimate the P and R values
via data observed from the environment.
Planning Component Decide which actions to take
that will maximise reward.
Exploration vs. Exploitation
GLIE (Greedy in Limit with
Infinite Exploration)

13
Planning vs. MDP vs. RL

MDP model system dynamics.
MDP algorithms solve the optimisation equations.
Planning is modeled as MDPs.
Planning algorithms speed up MDP algorithms.
RL is modeled over MDPs.
RL algorithms use MDP equations as basis.
RL algorithms speed up algorithms for
simultaneous planning and learning.

14
Exploration vs. Exploitation

Exploration Choose actions that visit new
states in order to obtain more data for better
learning.
Exploitation Choose actions that maximise the
reward given current learnt model.
A solution GLIE - Greedy in Limit
with Infinite Exploration.

15
Model Based Learning

First learn the model.
Then use MDP algorithms.
Very slow, and uses a lot of data.
Optimisations proposed DYNA, Prioritised
Sweeping etc.
Uses less data, comparitively slow.

16
Model Free Learning

Learn the policy without learning an explicit
model.
Do not estimate P, and R explicitly.
E.g. Temporal Difference Learning.
Very popular, fast, require a lot of data.

17
Learning

Model-based learning
Learn the model, and do planning
Requires less data, more computation
Model-free learning
Plan without learning an explicit model
Requires a lot of data, less computation

18
Q-Learning

Instead of learning, P and R, learn Q directly.
Q(s,a) Optimal reward starting in s,
if the first action is a, and
after that the optimal policy is followed.
Q directly defines the optimal policy

Optimal policy is the action with maximum Q
value.
19
Q-Learning

Given an experience tuple hs,a,s,ri
Under suitable assumptions, and GLIE exploration
Q-Learning converges to
optimal.

New estimate of Q value
Old estimate of Q value
20
Semi-MDP When actions take time.

The Semi-MDP equation
Semi-MDP Q-Learning equation
where experience tuple is hs,a,s,r,Ni
r accumulated discounted reward
while action a was
executing.

21
Printerbot

Paul G. Allen Center has 85000 sq ft space
Each floor 85000/7 12000 sq ft
Discretise location on a floor 12000 parts.
State Space (without map) 221200012000 ---
very large!!!!!
How do humans do the
decision making?

22
The Outline of the Talk

MDPs and Bellmans curse of dimensionality.
RL Simultaneous learning and planning.
Explore avenues to speedup RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise

23
1. The Mathematical PerspectiveA Structure
Paradigm

S Relational MDP
A Concurrent MDP
P Dynamic Bayes Nets
R Continuous-state MDP
G Conjunction of state variables
V Algebraic Decision Diagrams
? Decision List (RMDP)

24
2. Modular Decision Making
25
2. Modular Decision Making

Go out of room
Walk in hallway
Go in the room

26
2. Modular Decision Making

Humans plan modularly at different granularities
of understanding.
Going out of one room is similar to going out of
another room.
Navigation steps do not depend on whether we have
the print out or not.

27
3. Background Knowledge

Classical Planners using additional control
knowledge can scale up to larger problems.
(E.g. HTN planning, TLPlan)
What forms of control knowledge can we provide to
our Printerbot?
First pick printouts, then deliver them.
Navigation consider rooms, hallway, separately,
etc.

28
A mechanism that exploits all three avenues
Hierarchies

Way to add a special (hierarchical) structure on
different parameters of an MDP.
Draws from the intuition and reasoning in human
decision making.
Way to provide additional control knowledge to
the system.

29
The Outline of the Talk

MDPs and Bellmans curse of dimensionality.
RL Simultaneous learning and planning.
Explore avenues to speedup RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise

30
Hierarchy

Hierarchy of Behaviour, Skill, Module,
SubTask, Macro-action, etc.
picking the pages
collision avoidance
fetch pages phase
walk in hallway
HRL RL with temporally
extended actions

31
Hierarchical Algos Gating Mechanism

Hierarchical Learning
Learning the gating function
Learning the individual behaviours
Learning both

g is a gate
bi is a behaviour
Can be a multi- level hierarchy.
32
Option Movee until end of hallway

Start Any state in the hallway.
Execute policy as shown.
Terminate when s is end of hallway.

33
Options Sutton, Precup, Singh99

An option is a well defined behaviour.
o h Io, ?o, ?o i
Io Set of states (IoµS) in which o can be
initiated.
?o(s) Policy (S!A) when o is executing.
?o(s) Probability that o terminates
in s.

Can be a policy over lower level options.
34
Learning

An option is temporally extended action with well
defined policy.
Set of options (O) replaces the set of actions
(A)
Learning occurs outside options.
Learning over options Semi MDP Q-Learning.

35
Machine Movee Collision Avoidance
End of hallway
Call M1
Movee
Choose
Obstacle
Call M2
End of hallway
Return
M1
M2
36
Hierarchies of Abstract MachinesParr,
Russell97

A machine is a partial policy represented by a
Finite State Automaton.
Node
Execute a ground action.
Call a machine as a subroutine.
Choose the next node.
Return to the calling machine.

37
Hierarchies of Abstract Machines

A machine is a partial policy represented by a
Finite State Automaton.
Node
Execute a ground action.
Call a machine as subroutine.
Choose the next node.
Return to the calling machine.

38
Learning

Learning occurs within machines, as machines are
only partially defined.
Flatten all machines out and consider states
s,m where s is a world state, and m, a machine
node MDP
reduce(SoM) Consider only states where machine
node is a choice node Semi-MDP.
Learning ¼ Semi-MDP Q-Learning

39
Task Hierarchy MAXQ DecompositionDietterich00
Root
Children of a task are unordered
Deliver
Fetch
Take
Give
Navigate(loc)
Extend-arm
Extend-arm
Grab
Release
Movee
Movew
Moves
Moven
40
MAXQ Decomposition

Augment the state s by adding the subtask
i s,i.
Define C(s,i,j) as the reward received in i
after j finishes.
Q(s,Fetch,Navigate(prr)) V(s,Navigate(prr))
C(s,Fetch,Navigate(prr))
Express V in terms of C
Learn C, instead of learning Q

Reward received while navigating
Reward received after navigation
Observe the context-free nature of Q-value
41
The Outline of the Talk

MDPs and Bellmans curse of dimensionality.
RL Simultaneous learning and planning.
Explore avenues to speedup RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise

42
1. State Abstraction

Abstract state A state having fewer state
variables different world states maps to the
same abstract state.
If we can reduce some state variables, then we
can reduce on the learning time considerably!
We may use different abstract states for
different macro-actions.

43
State Abstraction in MAXQ

Relevance Only some variables are
relevant for the task.
Fetch user-loc irrelevant
Navigate(printer-room) h-r-po,h-u-po,user-loc
Fewer params for V of lower levels.
Funnelling Subtask maps many states to
smaller set of states.
Fetch All states map to h-r-potrue,
locpr.room.
Fewer params for C of higher levels.

44
State Abstraction in Options, HAM

Options Learning required only in states that
are terminal states for some option.
HAM Original work has no abstraction.
Extension Three-way value decomposition
Q(s,m,n) V(s,n) C(s,m,n) Cex(s,m)
Similar abstractions are employed.

Andre,Russell02
45
2. Optimality
Hierarchical Optimality vs. Recursive
Optimality
46
Optimality

Options Hierarchical
Use (A O) Global
Interrupt options
HAM Hierarchical
MAXQ Recursive
Interrupt subtasks
Use Pseudo-rewards
Iterate!

Can define eqns for both optimalities Adv.
of using macro-actions maybe lost.
47
3. Language Expressiveness

Option
Can only input a complete policy
HAM
Can input a complete policy.
Can input a task hierarchy.
Can represent amount of effort.
Later extended to partial programs.
MAXQ
Cannot input a policy (full/partial)

48
4. Knowledge Requirements

Options
Requires complete specification of policy.
One could learn option policies given subtasks.
HAM
Medium requirements
MAXQ
Minimal requirements

49
5. Models advanced

Options Concurrency
HAM Richer representation, Concurrency
MAXQ Continuous time, state, actions
Multi-agents, Average-reward.
In general, more researchers have followed MAXQ
Less input knowledge
Value decomposition

50
6. Structure Paradigm

S Options, MAXQ
A All
P None
R MAXQ
G All
V MAXQ
? All

51
The Outline of the Talk

MDPs and Bellmans curse of dimensionality.
RL Simultaneous learning and planning.
Explore avenues to speedup RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise

52
Directions for Future Research

Bidirectional State Abstractions
Hierarchies over other RL research
Model based methods
Function Approximators
Probabilistic Planning
Hierarchical P and Hierarchical R
Imitation Learning

53
Directions for Future Research

Theory
Bounds (goodness of hierarchy)
Non-asymptotic analysis
Automated Discovery
Discovery of Hierarchies
Discovery of State Abstraction
Apply

54
Applications

Toy Robot
Flight Simulator
AGV Scheduling
Keepaway soccer

Images courtesy various sources
55
Thinking Big

"... consider maze domains. Reinforcement
learning researchers, including this author, have
spent countless years of research solving a
solved problem! Navigating in grid worlds, even
with stochastic dynamics, has been far from
rocket science since the advent of search
techniques such as A.
-- David Andre
Use planners, theorem provers, etc. as components
in big hierarchical solver.

56
The Outline of the Talk