Optimal Policies for POMDP

1 / 39
About This Presentation
Title:

Optimal Policies for POMDP

Description:

Infinite Horizon (discount ... No knowledge about which region this is optimal. ( Sondik) ... LP used to trim away useless vectors. Monahan Reduction Phase ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 40
Provided by: fs87

less

Transcript and Presenter's Notes

Title: Optimal Policies for POMDP


1
Optimal Policies for POMDP
  • Presented by Alp Sardag

2
As Much Reward As Possible?
Greedy Agent
3
How long agent take decision?
  • Finite Horizon
  • Infinite Horizon (discount factor)
  • Values will converge.
  • Good model if the number of decision step is not
    given.

4
Policy
  • General plan
  • Deterministic one action for each state
  • Stochastic pdf over the set of actions
  • Stationary can be applied at any time
  • Non-stationary dependent on time
  • Memoryless no history

5
Finite Horizon
  • Agent has to make k decisions, non-stationary

6
Infinite Horizon
  • We do not need different policy for each time
    step.

0lt?lt1
Infiniteness helps us to find stationary
policy. ??0, ?1,..., ?t ??i, ?i,..., ?i
7
MDP
  • Finite horizon, solved with dynamic programming.
  • Infinite horizon S equations S unknowns LP.

8
MDP
  • Actions may be stochastic.
  • Do you know what state end up?
  • Dealing with uncertainity in observations.

9
POMDP Model
  • Finite set of states
  • Finite set of actions
  • Transition probabilities (as in MDP)
  • Observation model
  • Reinforcement

10
POMDP Model
  • Immediate reward for performing action a in state
    i.

11
POMDP Model
  • Belief state probability distribution over
    states.
  • ? ?0, ?1,...., ?S
  • Drawback to compute next state world model
    needed. From Bayes rule

12
POMDP Model
  • Control dynamics for a POMDP

13
Policies for POMDP
  • Belief states infinite, value functions in tables
    infeasible.
  • For horizon length 1.
  • No control over observations (not found in MDP),
    weigh all observations

14
Value functions for POMDPs
  • Formula is complex, however if VF is piecewise
    linear (a way of rep. Continous space VF), it can
    be written

15
Value functions for POMDPs
16
Value Functions for POMDPs
  • Given Vt-1, Vt can be calculated.
  • Keep the action which gives rise to specific ?
    vector.
  • To find optimal policy at a belief state, just
    perform maximization over all ? vectors and take
    the associated action.

17
Geometric Interpretation of VF
  • Belief simplex
  • 2 dimensional case

18
Geometric Interpretation of VF
  • 3 dimensional case

19
Alternate VF Interpretation
  • A decision tree could enumerate each possible
    policy for k-horizon, if initial belief state
    given.

20
Alternate VF Interpretation
  • The number of nodes for each action
  • The number of possible tree (A possible actions
    for each node)
  • Somehow only generate useful trees, the
    complexity will be greatly reduced.
  • Previously, to create entire VF generate ? for
    all ?, too many for the algorithm to work.

21
POMDP Solutions
  • For finite horizon
  • Iterate over time steps. Given Vt-1 compute Vt.
  • Retain all intermediate solutions.
  • For finitely transient, same idea apply to find
    infinite horizon.
  • Iterate until previous optimal value functions
    are the same for any two consecutive time steps.
  • Once infinite horizon found, discard all
    intermediate results.

22
POMDP Solutions
  • Given Vt-1 Vt can be calculated for one ? from
    previous formula. No knowledge about which region
    this is optimal. (Sondik)
  • Too many ? to construct VF, one possible
    solution
  • Choose random points.
  • If the number of points is large, one cant miss
    any of true vectors.
  • How many points to choose? No guarantee.
  • Find optimal policies by developing a
    systematic algorithm to explore the entire
    continous space of beliefs.

23
Tiger Problem
  • Actions open left door, open right door, listen.
  • Listenning not accurate.
  • s0 tiger on the left, s1 tiger on the right.
  • Rewards 10 openning right door, -100 for wrong
    door, -1 for listenning.
  • Initially ? (0.5 0.5)

24
Tiger Problem
25
Tiger Problem
  • First action, intuitively
  • -10010?2-55 -1 for listenning
  • For horizon length 1

26
Tiger Problem
  • For Horizon length 2

27
Tiger Problem
  • For horizon length 4, nice features
  • A belief state for the same action observation
    transformed to a single belief state.
  • Observations made precisely define the nodes in
    the graph that would be traversed.

28
Infinite Horizon
  • Finite horizon cumbersome, different policy for
    the same belief point for each time step.
  • Different set of vectors for each time step.
  • Add discount factor to tiger problem, after 56.
    Step the underlying vectors are slightly
    different

29
Infinite Horizon for Tiger Problem
  • By this way the finite horizon algorithms can be
    used for the infinite horizon problems.
  • Advantage of infinite horizon, keep the last
    policy.

30
Policy Graphs
  • A way to encode, without keeping vectors, no dot
    products.

Beginning state
Endstate
31
Finite Transience
  • All the belief states within a particular
    partition element will be transformed to another
    element for a particular action and observation.
  • For non-finitely transient policies the policy
    graphs that are exactly optimal can not be
    constructed.

32
Overview of Algorithms
  • All performed iteratively.
  • All try to find the set of vectors that define
    both the value function and the optimal policy at
    each time step.
  • Two separate class
  • Given Vt-1, generate superset of Vt, reduce that
    set until the optimal Vt found (Monahan and
    Eagle).
  • Given Vt-1 construct subset of optimal Vt. These
    subsets grow larger until optimal Vt found.

33
Monahan Algorithm
  • Easy to implement
  • Do not expect to solve anything but smallest of
    problems.
  • Provides background for understanding of other
    algorithms.

34
Monahan Enumeration Phase
  • Generate all vectors
  • Number of gen. Vectors AM?
  • where M vectors of previous state

35
Monahan Reduction Phase
  • All vectors can be kept
  • Each time maximize over all vectors.
  • Lot of excess baggage
  • The number of vectors in next step will be even
    large.
  • LP used to trim away useless vectors

36
Monahan Reduction Phase
  • For a vector to be useful, there must be at least
    one belief point it gives larger value than
    others

37
Monahan Algorithm
38
Monahans LP Complication
39
Future Work
  • Eagles Variant of Monahans Algorithm.
  • Sondiks One-Pass Algorithm.
  • Chengs Relaxed Region Algorithm.
  • Chengs Linear Support Algorithm.
Write a Comment
User Comments (0)