Title: Markov Decision Processes: Reactive Planning to Maximize Reward
1Markov Decision ProcessesReactive Planning to
Maximize Reward
Brian C. Williams 16.410 November 8th, 2004
Slides adapted from Manuela Veloso, Reid
Simmons, Tom Mitchell, CMU
4/1/2014
2Reading and Assignments
- Markov Decision Processes
- Read AIMA Chapter 17, Sections 1 3.
- This lecture based on development in Machine
Learning by Tom Mitchell Chapter 13
Reinforcement Learning
3How Might a Mouse Search a Maze for Cheese?
Cheese
- State Space Search?
- As a Constraint Satisfaction Problem?
- Goal-directed Planning?
- Linear Programming?
- What is missing?
4Ideas in this lecture
- Problem is to accumulate rewards, rather than to
achieve goal states. - Approach is to generate reactive policies for how
to act in all situations, rather than plans for a
single starting situation. - Policies fall out of value functions, which
describe the greatest lifetime reward achievable
at every state. - Value functions are iteratively approximated.
5MDP Examples TD-Gammon Tesauro, 1995Learning
Through Reinforcement
- Learns to play Backgammon
- States
- Board configurations (1020)
- Actions
- Moves
- Rewards
- 100 if win
- - 100 if lose
- 0 for all other states
- Trained by playing 1.5 million games against
self. - Currently, roughly equal to best human player.
6MDP Examples Aerial Robotics Feron et
al.Computing a Solution from a Continuous Model
7Markov Decision Processes
- Motivation
- What are Markov Decision Processes (MDPs)?
- Models
- Lifetime Reward
- Policies
- Computing Policies From a Model
- Summary
8MDP Problem
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime reward
9MDP Problem Model
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime reward
10Markov Decision Processes (MDPs)
Process
- Model
- Finite set of states, S
- Finite set of actions, A
- (Probabilistic) state transitions, d(s,a)
- Reward for each state and action, R(s,a)
- Observe state st in S
- Choose action at in A
- Receive immediate reward rt
- State changes to st1
s0
s1
r0
- Legal transitions shown
- Reward on unlabeled transitions is 0.
G
11MDP Environment Assumptions
- Markov Assumption Next state and reward is a
function only of the current state and action - st1 d(st, at)
- rt r(st, at)
- Uncertain and Unknown Environment
- d and r may be nondeterministic and unknown
12MDP Nondeterministic Example
Today we only considerthe deterministic case
R Research D Development
13MDP Problem Model
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime reward
14MDP Problem Lifetime Reward
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime reward
15Lifetime Reward
- Finite horizon
- Rewards accumulate for a fixed period.
- 100K 100K 100K 300K
- Infinite horizon
- Assume reward accumulates for ever
- 100K 100K . . . infinity
- Discounting
- Future rewards not worth as much(a bird in hand
) - Introduce discount factor g100K g 100K g 2
100K. . . converges - Will make the math work
16MDP Problem Lifetime Reward
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
17MDP Problem Policy
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
18Assume deterministic world
- Policy p S ?A
- Selects an action for each state.
- Optimal policy p S ?A
- Selects action for each state that maximizes
lifetime reward.
19- There are many policies, not all are necessarily
optimal. - There may be several optimal policies.
20Markov Decision Processes
- Motivation
- What are Markov Decision Processes (MDPs)?
- Models
- Lifetime Reward
- Policies
- Computing Policies From a Model
- Summary
21Markov Decision Processes
- Motivation
- Markov Decision Processes
- Computing Policies From a Model
- Value Functions
- Mapping Value Functions to Policies
- Computing Value Functions through Value Iteration
- An Alternative Policy Iteration (appendix)
- Summary
22Value Function Vp for a Given Policy p
- Vp(st) is the accumulated lifetime reward
resulting from starting in state st and
repeatedly executing policy p - Vp(st) rt g rt1 g 2 rt2 . . .
- Vp(st) ?i g i rtIwhere rt, rt1 ,
rt2 . . . are generated by following p,
starting at st .
Vp
9
9
10
Assume g .9
10
10
0
23An Optimal Policy p Given Value Function V
- Idea Given state s
- Examine all possible actions ai in state s.
- Select action ai with greatest lifetime reward.
- Lifetime reward Q(s, ai) is
- the immediate reward for taking action r(s,a)
- plus life time reward starting in target state V(
d(s, a) ) - discounted by g.
- p(s) argmaxa r(s,a) gV( d(s, a) )
- Must Know
- Value function
- Environment model.
- d S x A ? S
- r S x A ? ?
p
9
9
10
10
G
10
10
10
10
0
24Example Mapping Value Function to Policy
- Agent selects optimal action from V
- p(s) argmaxa r(s,a) gV(d(s, a)
Model V
g 0.9
100
100
25Example Mapping Value Function to Policy
- Agent selects optimal action from V
- p(s) argmaxa r(s,a) gV(d(s, a)
Model V
g 0.9
a
90
100
0
100
G
- a 0 0.9 x 100 90
- b 0 0.9 x 81 72.9
- select a
b
100
81
90
100
26Example Mapping Value Function to Policy
- Agent selects optimal action from V
- p(s) argmaxa r(s,a) gV(d(s, a)
Model V
g 0.9
90
100
0
100
G
a
- a 100 0.9 x 0 100
- b 0 0.9 x 90 81
- select a
b
100
81
90
100
p
G
27Example Mapping Value Function to Policy
- Agent selects optimal action from V
- p(s) argmaxa r(s,a) gV(d(s, a)
Model V
g 0.9
90
100
0
100
G
a
b
100
81
90
100
c
p
G
28Markov Decision Processes
- Motivation
- Markov Decision Processes
- Computing Policies From a Model
- Value Functions
- Mapping Value Functions to Policies
- Computing Value Functions through Value Iteration
- An Alternative Policy Iteration
- Summary
29Value Function V for an optimal policy p
Example
- Optimal value function for a one step horizon
V1(s) maxai r(s,ai)
30Value Function V for an optimal policy p
Example
- Optimal value function for a one step horizon
- V1(s) maxai r(s,ai)
- Optimal value function for a two step horizon
V2(s) maxai r(s,ai) gV 1(d(s, ai))
g
V1(SA)
RA
A
SA
SA
- Instance of the Dynamic Programming Principle
- Reuse shared sub-results
- Exponential saving
B
. . .
SB
SB
V1(SB)
31Value Function V for an optimal policy p
Example
- Optimal value function for a one step horizon
- V1(s) maxai r(s,ai)
- Optimal value function for a two step horizon
- V2(s) maxai r(s,ai) gV 1(d(s, ai))
- Optimal value function for an n step horizon
Vn(s) maxai r(s,ai) gV n-1(d(s, ai))
32Value Function V for an optimal policy p
Example
- Optimal value function for a one step horizon
- V1(s) maxai r(s,ai)
- Optimal value function for a two step horizon
- V2(s) maxai r(s,ai) gV 1(d(s, ai))
- Optimal value function for an n step horizon
- Vn(s) maxai r(s,ai) gV n-1(d(s, ai))
- Optimal value function for an infinite horizon
V(s) maxai r(s,ai) gV(d(s, ai))
33Solving MDPs by Value Iteration
- Insight Can calculate optimal values iteratively
using Dynamic Programming. - Algorithm
- Iteratively calculate value using Bellmans
Equation - Vt1(s) ? maxa r(s,a) gV t(d(s, a))
- Terminate when values are close enough
- Vt1(s) - V t (s) lt e
- Agent selects optimal action by one step
lookahead on V - p(s) argmaxa r(s,a) gV(d(s, a)
34Convergence of Value Iteration
- If terminate when values are close enough
- Vt1(s) - V t (s) lt e
- Then
- Maxs in S Vt1(s) - V (s) lt 2eg/(1 - g)
- Converges in polynomial time.
- Convergence guaranteed even if updates are
performed infinitely often, but asynchronously
and in any order.
35Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
g 0.9
V t
V t1
0
0
0
100
100
a
G
G
0
b
0
0
0
100
100
- a 0 0.9 x 0 0
- b 0 0.9 x 0 0
- Max 0
36Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
g 0.9
V t
V t1
0
0
0
100
100
G
G
0
100
a
c
b
0
0
0
100
100
- a 100 0.9 x 0 100
- b 0 0.9 x 0 0
- c 0 0.9 x 0 0
- Max 100
37Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
g 0.9
V t
V t1
0
0
0
100
100
G
G
0
100
0
a
0
0
0
100
100
38Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
g 0.9
V t
V t1
0
0
0
100
100
G
G
0
100
0
0
0
0
100
100
0
39Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
g 0.9
V t
V t1
0
0
0
100
100
G
G
0
100
0
0
0
0
100
100
0
0
40Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
g 0.9
V t
V t1
0
0
0
100
100
G
G
0
100
0
0
0
0
100
100
0
0
100
41Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
g 0.9
V t
V t1
100
100
G
90
100
0
100
100
0
90
100
42Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
g 0.9
V t
V t1
100
100
G
90
100
0
100
100
81
90
100
43Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
g 0.9
V t
V t1
100
100
G
90
100
0
100
100
81
90
100
44Markov Decision Processes
- Motivation
- Markov Decision Processes
- Computing policies from a modelValue Functions
- Mapping Value Functions to Policies
- Computing Value Functions through Value Iteration
- An Alternative Policy Iteration (appendix)
- Summary
45Appendix Policy Iteration
- Idea Iteratively improve the policy
- Policy Evaluation Given a policy pi calculate
Vi Vpi, the utility of each state if pi were
to be executed. - Policy Improvement Calculate a new maximum
expected utility policy pi1 using one-step look
ahead based on Vi. - pi improves at every step, converging if pi
pi1. - Computing Vi is simpler than for Value iteration
(no max) - Vt1(s) ? r(s, pi(s)) gV t(d(s, pi(s)))
- Solve linear equations in O(N3)
- Solve iteratively, similar to value iteration.
46Markov Decision Processes
- Motivation
- Markov Decision Processes
- Computing policies from a model
- Value Iteration
- Policy Iteration
- Summary
47Markov Decision Processes (MDPs)
- Model
- Finite set of states, S
- Finite set of actions, A
- Probabilistic state transitions, d(s,a)
- Reward for each state and action, R(s,a)
Deterministic Example
48Crib Sheet MDPs by Value Iteration
- Insight Can calculate optimal values iteratively
using Dynamic Programming. - Algorithm
- Iteratively calculate value using Bellmans
Equation - Vt1(s) ? maxa r(s,a) gV t(d(s, a))
- Terminate when values are close enough
- Vt1(s) - V t (s) lt e
- Agent selects optimal action by one step
lookahead on V - p(s) argmaxa r(s,a) gV(d(s, a)
49Ideas in this lecture
- Objective is to accumulate rewards, rather than
goal states. - Objectives are achieved along the way, rather
than at the end. - Task is to generate policies for how to act in
all situations, rather than a plan for a single
starting situation. - Policies fall out of value functions, which
describe the greatest lifetime reward achievable
at every state. - Value functions are iteratively approximated.
50How Might a Mouse Search a Maze for Cheese?
Cheese
- By Value Iteration?
- What is missing?