Tutorial on Finite State Controllers and Policy Search - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Tutorial on Finite State Controllers and Policy Search

Description:

... [Meuleau et al. 99, Aberdeen & Baxter 02] Branch and bound [Meuleau et ... Gradient-based policy search [Baxter, Bartlett 00] Natural policy gradient [Kakade 02] ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 29
Provided by: scien84
Category:

less

Transcript and Presenter's Notes

Title: Tutorial on Finite State Controllers and Policy Search


1
Tutorial on Finite State Controllers and Policy
Search
July 14, 2008 AAAI Workshop on Advancements in
POMDP solvers
Pascal Poupart University of Waterloo ppoupart_at_cs
.uwaterloo.ca
2
Outline
  • Policy representations
  • Policy iteration
  • Bounded controllers
  • Bounded policy search
  • Bounded policy iteration
  • Non-convex optimization
  • Maximum likelihood
  • Synthesis

3
Policy Representations
  • ? H ? A (histories to actions)
  • a0,o0,a1,o1,,an,on ? a
  • Problem growing history
  • ? B ? A (beliefs to actions)
  • Problem we cant enumerate all beliefs
  • Alternatively, ? ? ? A (?-vectors to actions)

a3
a1
?1
a2
?2
?3
Belief space
4
Policy Iteration
  • Sondik 71, 78 (description from Hansen 97)
  • Think of POMDP as a continuous belief MDP
  • Apply policy iteration for MDPs

Policy improvement ?(b) ? argmaxa R(b,a) ? ?o
Pr(ob,a) V(?(b,a,o)
How?
Policy evaluation Compute V?(b)
5
Finite State Controller
  • Nodes actions ?(n) a
  • Edges observations ß(n,o) n
  • Policy p lt?,ßgt

a2
o2
o2
a1
o2
o1
o1
a1
o1
a3
o1
o1
o2
a3
o1
o2
o2
a2
6
Policy Evaluation
  • Solve linear system

Vn(b) R(b,?(n)) ? So Pr(ob,s(n))
Vß(n,o)(?(b,?(n),o))
V
Belief space
7
Policy Iteration
  • Sondik 71, 78 (description from Hansen 97)

Policy improvement ?(b) ? argmaxa R(b,a) ? ?o
Pr(ob,a) V(?(b,a,o))
Policy evaluation Vn(b) R(b,?(n)) ? So
Pr(ob,?(n)) Vß(n,o)(?(b,?(n),o))
8
Improved Policy Iteration
  • Hansen 97)

Policy improvement ?(b) ? argmaxa R(b,a) ? ?o
Pr(ob,a) V(?(b,a,o))
Policy evaluation Vn(b) R(b,?(n)) ? So
Pr(ob,?(n)) Vß(n,o)(?(b,?(n),o))
9
Policy improvement Hansen
  • Create new nodes for all possible ? and ß
  • Total of ANO new nodes

o1
a1
o1
o2
a1
a1
o1
o2
o2
o2
a1
o1
o1
a2
o2
o1
a2
o2
10
Policy improvement Hansen
  • Retain only blue dominating nodes
  • i.e. those part of the upper surface

o1
a1
o1
o2
a1
a1
o1
o2
o2
o2
a1
o1
o1
a2
o2
o1
a2
o2
11
Policy improvement Hansen
  • Prune pointwise dominated black nodes
  • i.e. those dominated by a single node

o1
a1
o1
o2
a1
a1
o1
o2
o2
o1
a2
o2
o1
a2
o2
12
Exponential Growth
  • Problem controllers tend to grow exponentially!
  • At each iteration, up to ANO nodes may be
    added
  • Solution Bounded Controllers

13
Policy Search for Bounded Controllers
  • Gradient ascent Meuleau et al. 99, Aberdeen
    Baxter 02
  • Branch and bound Meuleau et al. 99
  • Stochastic Local Search Braziunas, Boutilier 04
  • Bounded policy iteration Poupart 03
  • Non-convex optimization Amato et al. 07
  • Maximum likelihood Toussaint et al. 06

14
Stochastic Controllers
  • Policy search often done with stochastic
    controllers
  • ?(n) Pr(an)
  • ß(n,o) Pr(no,n)
  • Why?
  • Continuous parameterization
  • More expressive policy space

15
Bounded Policy Improvement
  • Improve each node in turn Poupart, Boutilier 03
  • Replace with dominating stochastic node

16
Bounded Policy Improvement
  • Improve each node in turn Poupart, Boutilier 03
  • Replace with dominating stochastic node

17
Node Improvement
  • Linear Programming
  • O(SAO) constraints
  • O(AON) variables

Objective max e Variables
Pr(a,nn,o) Constraints Vn e ? San
Pr(a,nn,ok) Ra ? So tra,o Pr(a,nn,o)
Vn Sn Pr(a,nn,ok) Sn Pr(a,nn,o) ?a,o
18
Synthetic Network Management
  • Poupart, Boutilier 04
  • 3legs25 33,554,432 states, 51 actions, 2 obs.

19
Sparse Node Improvement
  • Hansen 08
  • Observation
  • Controllers are mostly deterministic
  • Few non-zero parameters
  • Proposal
  • Column generation
  • Solve several reduced LPs
  • O(O) variables (instead of O(AON))
  • Can be several orders of magnitude faster

20
Non-convex optimization
  • Amato et al. 07
  • Quadratically constrained problem
  • N more variables constraints than LPs in BPI

Objective max b0Vn0 Variables
Pr(a,nn,o),Vn Constraints Vn San
Pr(a,nn,ok) Ra ? So tra,o Pr(a,nn,o)
Vn Sn Pr(a,nn,ok) Sn Pr(a,nn,o)
?a,n,o
21
Alternating Optimization
  • Bounded policy iteration
  • Policy evaluation fix Pr(a,nn,o) and optimize
    Vn
  • Policy improvement fix Vn rhs and optimize
    Pr(a,nn,o), Vn lhs

Objective max b0Vn0 Variables
Pr(a,nn,o),Vn Constraints Vn San
Pr(a,nn,ok) Ra ? So tra,o Pr(a,nn,o)
Vn Sn Pr(a,nn,ok) Sn Pr(a,nn,o)
?a,n,o
22
Graphical Model
  • Meuleau et al. 99
  • Influence diagram that includes controller

n
n
o
o
a
s
s
23
Likelihood Maximization
  • Toussaint et al. 06
  • Mixture of DBNs with normalized terminal reward
  • Maximize reward likelihood
  • Expectation-Maximization

n0
1-?
o0
s0
r0
?(1-?)
...
n1
n0
nk
n2
...
?k(1-?)
o1
a1
o0
a0
ok
a2
o2
s2
sk
rk
s1
s0
24
Local Optima Analysis
  • Non-convex optimization problem
  • All existing algorithms get trapped in local
    optima
  • What do we know about local optima?

25
Local Optima Analysis
  • Theorem
  • Corollary

BPI is in a local optimum
Each nodes value function is tangent to the
backed up value function
Each node reachable from the initial belief state
has value tangent to the backed up value function
GA is in a local optimum
Value function Backed up value function Tangent
points
26
Escape Technique for BPI
  • Idea create new nodes to improve belief states
    reachable in one step from tangent belief
    states
  • Theorem

b
b
trao
No improvement at belief states reachable in one
step from tangent belief states
Policy is optimal at the tangent belief states
?
27
Summary
  • Bounded Controller Advantages
  • Easily interpretable policy
  • No need for belief monitoring
  • Real-time policy execution
  • Policy search as optimization Amato et al. 07
  • Wide range of optimization techniques
  • Policy search as likelihood maximization
    Toussaint 06
  • Wide range of inference techniques
  • Bounded Controller Drawback
  • Local optima

28
Other Policy Search Techniques
  • Policy search via density estimation Ng et al.
    99
  • PEGASUS Ng, Jordan 00
  • Gradient-based policy search Baxter, Bartlett
    00
  • Natural policy gradient Kakade 02
  • Covariant policy search Bagnell, Schneider 03
  • Policy search by dynamic programming Bagnell et
    al. 04
  • Point-based policy iteration Ji, Parr et al. 07
Write a Comment
User Comments (0)
About PowerShow.com