Title: Optimal Policies for POMDP
1Optimal Policies for POMDP
2As Much Reward As Possible?
Greedy Agent
3How long agent take decision?
- Finite Horizon
- Infinite Horizon (discount factor)
- Values will converge.
- Good model if the number of decision step is not
given.
4Policy
- General plan
- Deterministic one action for each state
- Stochastic pdf over the set of actions
- Stationary can be applied at any time
- Non-stationary dependent on time
- Memoryless no history
5Finite Horizon
- Agent has to make k decisions, non-stationary
6Infinite Horizon
- We do not need different policy for each time
step.
0lt?lt1
Infiniteness helps us to find stationary
policy. ??0, ?1,..., ?t ??i, ?i,..., ?i
7MDP
- Finite horizon, solved with dynamic programming.
- Infinite horizon S equations S unknowns LP.
8MDP
- Actions may be stochastic.
- Do you know what state end up?
- Dealing with uncertainity in observations.
9POMDP Model
- Finite set of states
- Finite set of actions
- Transition probabilities (as in MDP)
- Observation model
- Reinforcement
10POMDP Model
- Immediate reward for performing action a in state
i.
11POMDP Model
- Belief state probability distribution over
states. - ? ?0, ?1,...., ?S
- Drawback to compute next state world model
needed. From Bayes rule
12POMDP Model
- Control dynamics for a POMDP
13Policies for POMDP
- Belief states infinite, value functions in tables
infeasible. - For horizon length 1.
- No control over observations (not found in MDP),
weigh all observations
14Value functions for POMDPs
- Formula is complex, however if VF is piecewise
linear (a way of rep. Continous space VF), it can
be written
15Value functions for POMDPs
16Value Functions for POMDPs
- Given Vt-1, Vt can be calculated.
- Keep the action which gives rise to specific ?
vector. - To find optimal policy at a belief state, just
perform maximization over all ? vectors and take
the associated action.
17Geometric Interpretation of VF
- Belief simplex
- 2 dimensional case
18Geometric Interpretation of VF
19Alternate VF Interpretation
- A decision tree could enumerate each possible
policy for k-horizon, if initial belief state
given.
20Alternate VF Interpretation
- The number of nodes for each action
- The number of possible tree (A possible actions
for each node) - Somehow only generate useful trees, the
complexity will be greatly reduced. - Previously, to create entire VF generate ? for
all ?, too many for the algorithm to work.
21POMDP Solutions
- For finite horizon
- Iterate over time steps. Given Vt-1 compute Vt.
- Retain all intermediate solutions.
- For finitely transient, same idea apply to find
infinite horizon. - Iterate until previous optimal value functions
are the same for any two consecutive time steps. - Once infinite horizon found, discard all
intermediate results.
22POMDP Solutions
- Given Vt-1 Vt can be calculated for one ? from
previous formula. No knowledge about which region
this is optimal. (Sondik) - Too many ? to construct VF, one possible
solution - Choose random points.
- If the number of points is large, one cant miss
any of true vectors. - How many points to choose? No guarantee.
- Find optimal policies by developing a
systematic algorithm to explore the entire
continous space of beliefs.
23Tiger Problem
- Actions open left door, open right door, listen.
- Listenning not accurate.
- s0 tiger on the left, s1 tiger on the right.
- Rewards 10 openning right door, -100 for wrong
door, -1 for listenning. - Initially ? (0.5 0.5)
24Tiger Problem
25Tiger Problem
- First action, intuitively
- -10010?2-55 -1 for listenning
- For horizon length 1
26Tiger Problem
27Tiger Problem
- For horizon length 4, nice features
- A belief state for the same action observation
transformed to a single belief state. - Observations made precisely define the nodes in
the graph that would be traversed.
28Infinite Horizon
- Finite horizon cumbersome, different policy for
the same belief point for each time step. - Different set of vectors for each time step.
- Add discount factor to tiger problem, after 56.
Step the underlying vectors are slightly
different
29Infinite Horizon for Tiger Problem
- By this way the finite horizon algorithms can be
used for the infinite horizon problems. - Advantage of infinite horizon, keep the last
policy.
30Policy Graphs
- A way to encode, without keeping vectors, no dot
products.
Beginning state
Endstate
31Finite Transience
- All the belief states within a particular
partition element will be transformed to another
element for a particular action and observation. - For non-finitely transient policies the policy
graphs that are exactly optimal can not be
constructed.
32Overview of Algorithms
- All performed iteratively.
- All try to find the set of vectors that define
both the value function and the optimal policy at
each time step. - Two separate class
- Given Vt-1, generate superset of Vt, reduce that
set until the optimal Vt found (Monahan and
Eagle). - Given Vt-1 construct subset of optimal Vt. These
subsets grow larger until optimal Vt found.
33Monahan Algorithm
- Easy to implement
- Do not expect to solve anything but smallest of
problems. - Provides background for understanding of other
algorithms.
34Monahan Enumeration Phase
- Generate all vectors
- Number of gen. Vectors AM?
- where M vectors of previous state
35Monahan Reduction Phase
- All vectors can be kept
- Each time maximize over all vectors.
- Lot of excess baggage
- The number of vectors in next step will be even
large. - LP used to trim away useless vectors
36Monahan Reduction Phase
- For a vector to be useful, there must be at least
one belief point it gives larger value than
others
37Monahan Algorithm
38Monahans LP Complication
39Future Work
- Eagles Variant of Monahans Algorithm.
- Sondiks One-Pass Algorithm.
- Chengs Relaxed Region Algorithm.
- Chengs Linear Support Algorithm.