Title: Hierarchical POMDP Solutions
1Hierarchical POMDP Solutions
2Sequential Decision Making Under Uncertainty
What is the optimal policy?
3Manufacturing Processes(Mahadevan, Theocharous
FLAIRS 98)
- Reward
- Reward for consuming
- Penalize for filling buffers
- Penalize for machine breakdown
- Actions
- Produce
- Maintenance
- What is the optimal policy?
4Foveated Active Vision(Minut)
- Observations
- Local features
- Reward
- Reward for finding object
- Actions
- Where to saccade next
- What features to use
- What is the optimal policy?
5Many More Partially Observable Problems
- Assistive technologies
- Web searching, preference elicitation
- Sophisticated Computing
- Distributed file access, Network trouble-shooting
- Industrial
- Machine maintenance, manufacturing processes
- Social
- Education, medical diagnosis, health care
policymaking - Corporate
- Marketing, corporate policy
- .
6Overview
- Learning models of partially observable problems
is far from a solved problem - Computing policies for partially observable
domains is intractable - We Propose hierarchical solutions
- Learn models using less space and time
- Compute robust policies that cannot be computed
by previous approaches
7How?Spatial and Time Abstractions Reduce
Uncertainty
Spatial abstraction
MIT
Temporal abstraction
8Outline
- Sequential decision-making under uncertainty
- A Hierarchical POMDP model for robot navigation
- Heuristic macro-action selection in H-POMDPs
- Near Optimal macro-action selection for arbitrary
POMDPs - Representing H-POMDPs as DBNs
- Current and Future directions
9A Real System Robot Navigation
10Belief States(Probability Distributions over
states)
True State
Belief State
11Belief States(Probability Distributions over
states)
True State
Belief State
12Belief States(Probability Distributions over
states)
True State
Belief State
13Learning POMDPs
- Given As and Zs compute
- Ts and Os
- Estimate probability distribution
- over hidden states
- Count number of times a state
- was visited
- Update T and O and repeat.
- It is an Expectation Maximization algorithm
- An iterative procedure for doing maximum
likelihood parameter estimation over hidden state
variables - Converges to local maxima
A1
A2
T(S1i,A1a,S2j)
S1
S3
S2
Z1
Z2
Z3
O(O2z,S2i,A1a)
14Planning in POMDPs
- Belief states constitute a sufficient statistic
for making decisions (Markov property holds
Astrom 1965) - Bellman equation
Since we have an infinite state space, the
problem becomes computationally
intractable (PSPACE hard for finite
horizon) (UNDECIDABLE for infinite horizon)
15Our SolutionSpatial and Temporal Abstraction
- Learning
- A hierarchical Baum-Welch algorithm, which is
derived from the Baum-Welch algorithm for
training HHMMs (with Rohanimanesh and Mahadevan,
ICRA 2001) - Structure learning from weak priors (with
Mahadevan IROS 2002) - Inference can be done in linear time by
representing H-POMDPs as Dynamic Bayesian
Networks (DBNs) (with Murphy and Kaelbling, ICRA
2004) - Planning
- Heuristic macro-action selection (with Mahadevan,
ICRA 2002) - Near optimal macro-action selection (with
Kaelbling, NIPS 2003) - Structure Learning and Planning combined
- Dynamic POMDP abstractions (with Mannor and
Kaelbling)
16Outline
- Sequential decision-making under uncertainty
- A Hierarchical POMDP model for robot navigation
- Heuristic macro-action selection in H-POMDPs
- Near Optimal macro-action selection for arbitrary
POMDPs - Representing H-POMDPs as DBNs
- Current and Future directions
17Hierarchical POMDPs
18Hierarchical POMDPs
ACTIONS
(Fine, Singer, Tishby, MLJ 98)
19Experimental Environments
600 states
1200 states
20The Robot Navigation Domain
- The robot Pavlov in the real MSU environment
- The Nomad 200 simulator
21Learning Feature Detectors(Mahadevan,
Theocharous, Khaleeli MLJ 98)
- 736 hand-labeled-grids
- 8-fold cross-validation
- Classification error (m7.33, s3.7)
22Learning and Planning in H-POMDPs for Robot
Navigation
INITIAL H-POMDP
LEARNING HAND CODING
COMPILATION
TOPOLOGICAL MAP
PLANNING
ENVIRONMENT
PLANNING
PLANNING
EXECUTION
EM
TRAINED H-POMDP
NAVIGATION SYSTEM
23Outline
- Sequential decision-making under uncertainty
- A Hierarchical POMDP model for robot navigation
- Heuristic macro-action selection in H-POMDPs
- Near Optimal macro-action selection for arbitrary
POMDPs - Representing H-POMDPs as DBNs
- Current and Future directions
24Planning in H-POMDPs(Theocharous, Mahadevan
ICRA 2002)
Abstract actions
- Hierarchical MDP solutions (using the options
framework Sutton, Precup, Singh, AIJ) - Heuristic POMDP solutions
- MLS
Primitive actions
Beliefs b(s)
0.35
0.3
0.2
0.1
0.05
4,
10,
23,
49,
100,
40
10
5
100
20
p(b) go-west
v(go-west)
v(go-east)
25Plan Execution
26Plan Execution
27Plan Execution
28Plan Execution
29Intuition
- Probability distribution at the higher level
evolves more slowly - The agent does not decide what the best
macro-action to do every time step - Long term actions result in robot localization
30F-MLS Demo
31H-MLS Demo
32Hierarchical is More Successful
Unknown initial position
Success
Environment
Algorithm
MLS
MLS
QMDP
QMDP
33Hierarchical Takes Less Time to Reach Goal
Unknown initial position
?
Average Steps to Goal
Environment
Algorithm
QMDP
MLS
QMDP
MLS
34Hierarchical Plans are Computed Faster
Planning Time
Environment
Goal 2
Algorithm
Goal 1
Goal 2
Goal 1
35Outline
- Sequential decision-making under uncertainty
- A Hierarchical POMDP model for robot navigation
- Heuristic macro-action selection in H-POMDPs
- Near Optimal macro-action selection for arbitrary
POMDPs - Representing H-POMDPs as DBNs
- Current and Future directions
36Near Optimal Macro-action Selection(Theocharous,
Kaelbling NIPS 2003)
- Usually agents dont require the entire belief
space - Macro-actions can reduce belief space even more
- Tested in large scale robot navigation
- Only small part of the belief-space is required
- Learn approximate POMDP policies fast
- High success rate
- Better policies
- Does information gathering
37Dynamic Grids
38The Algorithm
True trajectory
True belief state
Resulting next true belief state
Simulation trajectories from g of macro
A (estimation of value at g)
Value of b is interpolated from its neighbors
Nearest grid point to b
39Experimental Setup
40Fewer Number of States
41Fewer Steps to Goal
42More Successful
43Information Gathering
44Information Gathering(scaling up)
45Dynamic POMDP Abstractions(Theocharous, Mannor,
Kaelbling)
Entropy thresholds
start
goal
Localization macros
46Fewer Steps to Goal
47Outline
- Sequential decision-making under uncertainty
- A Hierarchical POMDP model for robot navigation
- Heuristic macro-action selection in H-POMDPs
- Near Optimal macro-action selection for arbitrary
POMDPs - Representing H-POMDPs as DBNs
- Current and Future directions
48Dynamic Bayesian Networks
STATE POMDP
FACTORED DBN POMDP
of parameters
of parameters
49DBN Inference
L
1
50Representing H-POMDPs as Dynamic Bayesian
Networks(Theocharous, Murphy, Kaelbling ICRA
2004)
FACTORED DBN H-POMDP
STATE H-POMDP
51Representing H-POMDPs as Dynamic Bayesian
Networks(Theocharous, Murphy, Kaelbling ICRA
2004)
FACTORED DBN H-POMDP
STATE H-POMDP
52Representing H-POMDPs as Dynamic Bayesian
Networks(Theocharous, Murphy, Kaelbling ICRA
2004)
FACTORED DBN H-POMDP
STATE H-POMDP
53Representing H-POMDPs as Dynamic Bayesian
Networks(Theocharous, Murphy, Kaelbling ICRA
2004)
FACTORED DBN H-POMDP
STATE H-POMDP
54Representing H-POMDPs as Dynamic Bayesian
Networks(Theocharous, Murphy, Kaelbling ICRA
2004)
FACTORED DBN H-POMDP
STATE H-POMDP
55Complexity of Inference
FACTORED DBN H-POMDP
STATE H-POMDP
DBN H-POMDP
STATE POMDP
56Hierarchical Localizes better
Original
Factored DBN tied H-POMDP
Factored DBN H-POMDP
DBN H-POMDP
STATE POMDP
Before training
57Hierarchical Fits Data Better
Original
Factored DBN tied H-POMDP
Factored DBN H-POMDP
DBN H-POMDP
STATE POMDP
Before training
58Directions for Future Research
- In the future we will explore structure learning
- Bayesian model selection approaches
- Methods for learning compositional hierarchies
(recurrent nets, hierarchical sparse n-grams) - Natural language acquisition methods
- Identifying isomorphic processes
- Online learning
- Interactive Learning
- Application to real world problems
59Major Contributions
- The H-POMDP model
- Requires less training data
- Provides better state estimation
- Fast planning
- Macro-actions in POMDPS reduce uncertainty
- Information gathering
- Application of the algorithms to large scale
Robot navigation - Map Learning
- Planning and execution