Optimal Policies for POMDP

1 / 39

About This Presentation

Title:

Optimal Policies for POMDP

Description:

Infinite Horizon (discount ... No knowledge about which region this is optimal. ( Sondik) ... LP used to trim away useless vectors. Monahan Reduction Phase ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 40

Provided by: fs87

more less

Transcript and Presenter's Notes

Title: Optimal Policies for POMDP

1
Optimal Policies for POMDP

Presented by Alp Sardag

2
As Much Reward As Possible?
Greedy Agent
3
How long agent take decision?

Finite Horizon
Infinite Horizon (discount factor)
Values will converge.
Good model if the number of decision step is not
given.

4
Policy

General plan
Deterministic one action for each state
Stochastic pdf over the set of actions
Stationary can be applied at any time
Non-stationary dependent on time
Memoryless no history

5
Finite Horizon

Agent has to make k decisions, non-stationary

6
Infinite Horizon

We do not need different policy for each time
step.

0lt?lt1
Infiniteness helps us to find stationary
policy. ??0, ?1,..., ?t ??i, ?i,..., ?i
7
MDP

Finite horizon, solved with dynamic programming.
Infinite horizon S equations S unknowns LP.

8
MDP

Actions may be stochastic.
Do you know what state end up?
Dealing with uncertainity in observations.

9
POMDP Model

Finite set of states
Finite set of actions
Transition probabilities (as in MDP)
Observation model
Reinforcement

10
POMDP Model

Immediate reward for performing action a in state
i.

11
POMDP Model

Belief state probability distribution over
states.
? ?0, ?1,...., ?S
Drawback to compute next state world model
needed. From Bayes rule

12
POMDP Model

Control dynamics for a POMDP

13
Policies for POMDP

Belief states infinite, value functions in tables
infeasible.
For horizon length 1.
No control over observations (not found in MDP),
weigh all observations

14
Value functions for POMDPs

Formula is complex, however if VF is piecewise
linear (a way of rep. Continous space VF), it can
be written

15
Value functions for POMDPs
16
Value Functions for POMDPs

Given Vt-1, Vt can be calculated.
Keep the action which gives rise to specific ?
vector.
To find optimal policy at a belief state, just
perform maximization over all ? vectors and take
the associated action.

17
Geometric Interpretation of VF

Belief simplex
2 dimensional case

18
Geometric Interpretation of VF

3 dimensional case

19
Alternate VF Interpretation

A decision tree could enumerate each possible
policy for k-horizon, if initial belief state
given.

20
Alternate VF Interpretation

The number of nodes for each action
The number of possible tree (A possible actions
for each node)
Somehow only generate useful trees, the
complexity will be greatly reduced.
Previously, to create entire VF generate ? for
all ?, too many for the algorithm to work.

21
POMDP Solutions

For finite horizon
Iterate over time steps. Given Vt-1 compute Vt.
Retain all intermediate solutions.
For finitely transient, same idea apply to find
infinite horizon.
Iterate until previous optimal value functions
are the same for any two consecutive time steps.
Once infinite horizon found, discard all
intermediate results.

22
POMDP Solutions

Given Vt-1 Vt can be calculated for one ? from
previous formula. No knowledge about which region
this is optimal. (Sondik)
Too many ? to construct VF, one possible
solution
Choose random points.
If the number of points is large, one cant miss
any of true vectors.
How many points to choose? No guarantee.
Find optimal policies by developing a
systematic algorithm to explore the entire
continous space of beliefs.

23
Tiger Problem

Actions open left door, open right door, listen.
Listenning not accurate.
s0 tiger on the left, s1 tiger on the right.
Rewards 10 openning right door, -100 for wrong
door, -1 for listenning.
Initially ? (0.5 0.5)

24
Tiger Problem
25
Tiger Problem

First action, intuitively
-10010?2-55 -1 for listenning
For horizon length 1

26
Tiger Problem

For Horizon length 2

27
Tiger Problem

For horizon length 4, nice features
A belief state for the same action observation
transformed to a single belief state.
Observations made precisely define the nodes in
the graph that would be traversed.

28
Infinite Horizon

Finite horizon cumbersome, different policy for
the same belief point for each time step.
Different set of vectors for each time step.
Add discount factor to tiger problem, after 56.
Step the underlying vectors are slightly
different

29
Infinite Horizon for Tiger Problem

By this way the finite horizon algorithms can be
used for the infinite horizon problems.
Advantage of infinite horizon, keep the last
policy.

30
Policy Graphs

A way to encode, without keeping vectors, no dot
products.

Beginning state
Endstate
31
Finite Transience

All the belief states within a particular
partition element will be transformed to another
element for a particular action and observation.
For non-finitely transient policies the policy
graphs that are exactly optimal can not be
constructed.

32
Overview of Algorithms

All performed iteratively.
All try to find the set of vectors that define
both the value function and the optimal policy at
each time step.
Two separate class
Given Vt-1, generate superset of Vt, reduce that
set until the optimal Vt found (Monahan and
Eagle).
Given Vt-1 construct subset of optimal Vt. These
subsets grow larger until optimal Vt found.

33
Monahan Algorithm