Markov Decision Processes: Reactive Planning to Maximize Reward

About This Presentation

Title:

Markov Decision Processes: Reactive Planning to Maximize Reward

Description:

Read AIMA Chapter 17, Sections 1 3. This lecture based on development in: ... MDP Examples: TD-Gammon [Tesauro, 1995] Learning Through Reinforcement ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 51

Provided by: johnc106

Learn more at: http://web.mit.edu

more less

Transcript and Presenter's Notes

Title: Markov Decision Processes: Reactive Planning to Maximize Reward

1
Markov Decision ProcessesReactive Planning to
Maximize Reward
Brian C. Williams 16.410 November 8th, 2004
Slides adapted from Manuela Veloso, Reid
Simmons, Tom Mitchell, CMU
4/1/2014
2
Reading and Assignments

Markov Decision Processes
Read AIMA Chapter 17, Sections 1 3.
This lecture based on development in Machine
Learning by Tom Mitchell Chapter 13
Reinforcement Learning

3
How Might a Mouse Search a Maze for Cheese?
Cheese

State Space Search?
As a Constraint Satisfaction Problem?
Goal-directed Planning?
Linear Programming?
What is missing?

4
Ideas in this lecture

Problem is to accumulate rewards, rather than to
achieve goal states.
Approach is to generate reactive policies for how
to act in all situations, rather than plans for a
single starting situation.
Policies fall out of value functions, which
describe the greatest lifetime reward achievable
at every state.
Value functions are iteratively approximated.

5
MDP Examples TD-Gammon Tesauro, 1995Learning
Through Reinforcement

Learns to play Backgammon
States
Board configurations (1020)
Actions
Moves
Rewards
100 if win
- 100 if lose
0 for all other states
Trained by playing 1.5 million games against
self.
Currently, roughly equal to best human player.

6
MDP Examples Aerial Robotics Feron et
al.Computing a Solution from a Continuous Model
7
Markov Decision Processes

Motivation
What are Markov Decision Processes (MDPs)?
Models
Lifetime Reward
Policies
Computing Policies From a Model
Summary

8
MDP Problem
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime reward
9
MDP Problem Model
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime reward
10
Markov Decision Processes (MDPs)
Process

Model
Finite set of states, S
Finite set of actions, A
(Probabilistic) state transitions, d(s,a)
Reward for each state and action, R(s,a)

Observe state st in S
Choose action at in A
Receive immediate reward rt
State changes to st1

s0
s1
r0

Legal transitions shown
Reward on unlabeled transitions is 0.

G
11
MDP Environment Assumptions

Markov Assumption Next state and reward is a
function only of the current state and action
st1 d(st, at)
rt r(st, at)
Uncertain and Unknown Environment
d and r may be nondeterministic and unknown

12
MDP Nondeterministic Example
Today we only considerthe deterministic case
R Research D Development
13
MDP Problem Model
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime reward
14
MDP Problem Lifetime Reward
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime reward
15
Lifetime Reward

Finite horizon
Rewards accumulate for a fixed period.
100K 100K 100K 300K
Infinite horizon
Assume reward accumulates for ever
100K 100K . . . infinity
Discounting
Future rewards not worth as much(a bird in hand
)
Introduce discount factor g100K g 100K g 2
100K. . . converges
Will make the math work

16
MDP Problem Lifetime Reward
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
17
MDP Problem Policy
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
18
Assume deterministic world

Policy p S ?A
Selects an action for each state.

Optimal policy p S ?A
Selects action for each state that maximizes
lifetime reward.

There are many policies, not all are necessarily
optimal.
There may be several optimal policies.

20
Markov Decision Processes

Motivation
What are Markov Decision Processes (MDPs)?
Models
Lifetime Reward
Policies
Computing Policies From a Model
Summary

21
Markov Decision Processes

Motivation
Markov Decision Processes
Computing Policies From a Model
Value Functions
Mapping Value Functions to Policies
Computing Value Functions through Value Iteration
An Alternative Policy Iteration (appendix)
Summary

22
Value Function Vp for a Given Policy p

Vp(st) is the accumulated lifetime reward
resulting from starting in state st and
repeatedly executing policy p
Vp(st) rt g rt1 g 2 rt2 . . .
Vp(st) ?i g i rtIwhere rt, rt1 ,
rt2 . . . are generated by following p,
starting at st .

Vp
9
9
10
Assume g .9
10
10
0
23
An Optimal Policy p Given Value Function V

Idea Given state s
Examine all possible actions ai in state s.
Select action ai with greatest lifetime reward.

Lifetime reward Q(s, ai) is
the immediate reward for taking action r(s,a)
plus life time reward starting in target state V(
d(s, a) )
discounted by g.
p(s) argmaxa r(s,a) gV( d(s, a) )
Must Know
Value function
Environment model.
d S x A ? S
r S x A ? ?

p
9
9
10
10
G
10
10
10
10
0
24
Example Mapping Value Function to Policy

Agent selects optimal action from V
p(s) argmaxa r(s,a) gV(d(s, a)

Model V
g 0.9
100
100
25
Example Mapping Value Function to Policy

Agent selects optimal action from V
p(s) argmaxa r(s,a) gV(d(s, a)

Model V
g 0.9
a
90
100
0
100
G

a 0 0.9 x 100 90
b 0 0.9 x 81 72.9
select a

b
100
81
90
100
26
Example Mapping Value Function to Policy

Agent selects optimal action from V
p(s) argmaxa r(s,a) gV(d(s, a)

Model V
g 0.9
90
100
0
100
G
a

a 100 0.9 x 0 100
b 0 0.9 x 90 81
select a

b
100
81
90
100
p
G
27
Example Mapping Value Function to Policy

Agent selects optimal action from V
p(s) argmaxa r(s,a) gV(d(s, a)

Model V
g 0.9
90
100
0
100
G

a ?
b ?
c ?
select ?

a
b
100
81
90
100
c
p
G
28
Markov Decision Processes

Motivation
Markov Decision Processes
Computing Policies From a Model
Value Functions
Mapping Value Functions to Policies
Computing Value Functions through Value Iteration
An Alternative Policy Iteration
Summary

29
Value Function V for an optimal policy p
Example

Optimal value function for a one step horizon

V1(s) maxai r(s,ai)
30
Value Function V for an optimal policy p
Example

Optimal value function for a one step horizon
V1(s) maxai r(s,ai)
Optimal value function for a two step horizon

V2(s) maxai r(s,ai) gV 1(d(s, ai))
g
V1(SA)
RA
A
SA
SA

Instance of the Dynamic Programming Principle
Reuse shared sub-results
Exponential saving

B
. . .
SB
SB
V1(SB)
31
Value Function V for an optimal policy p
Example

Optimal value function for a one step horizon
V1(s) maxai r(s,ai)
Optimal value function for a two step horizon
V2(s) maxai r(s,ai) gV 1(d(s, ai))

Optimal value function for an n step horizon

Vn(s) maxai r(s,ai) gV n-1(d(s, ai))
32
Value Function V for an optimal policy p
Example

Optimal value function for a one step horizon
V1(s) maxai r(s,ai)
Optimal value function for a two step horizon
V2(s) maxai r(s,ai) gV 1(d(s, ai))
Optimal value function for an n step horizon
Vn(s) maxai r(s,ai) gV n-1(d(s, ai))
Optimal value function for an infinite horizon

V(s) maxai r(s,ai) gV(d(s, ai))
33
Solving MDPs by Value Iteration

Insight Can calculate optimal values iteratively
using Dynamic Programming.
Algorithm
Iteratively calculate value using Bellmans
Equation
Vt1(s) ? maxa r(s,a) gV t(d(s, a))
Terminate when values are close enough
Vt1(s) - V t (s) lt e
Agent selects optimal action by one step
lookahead on V
p(s) argmaxa r(s,a) gV(d(s, a)

34
Convergence of Value Iteration

If terminate when values are close enough
Vt1(s) - V t (s) lt e
Then
Maxs in S Vt1(s) - V (s) lt 2eg/(1 - g)
Converges in polynomial time.
Convergence guaranteed even if updates are
performed infinitely often, but asynchronously
and in any order.

35
Example of Value Iteration

Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
0
0
0
100
100
a
G
G
0
b
0
0
0
100
100

a 0 0.9 x 0 0
b 0 0.9 x 0 0
Max 0

36
Example of Value Iteration

Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
0
0
0
100
100
G
G
0
100
a
c
b
0
0
0
100
100

a 100 0.9 x 0 100
b 0 0.9 x 0 0
c 0 0.9 x 0 0
Max 100

37
Example of Value Iteration

Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
0
0
0
100
100
G
G
0
100
0
a
0
0
0
100
100

a 0 0.9 x 0 0
Max 0

38
Example of Value Iteration

Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
0
0
0
100
100
G
G
0
100
0
0
0
0
100
100
0
39
Example of Value Iteration

Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
0
0
0
100
100
G
G
0
100
0
0
0
0
100
100
0
0
40
Example of Value Iteration

Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
0
0
0
100
100
G
G
0
100
0
0
0
0
100
100
0
0
100
41
Example of Value Iteration

Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
100
100
G
90
100
0
100
100
0
90
100
42
Example of Value Iteration

Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
100
100
G
90
100
0
100
100
81
90
100
43
Example of Value Iteration

Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
100
100
G
90
100
0
100
100
81
90
100
44
Markov Decision Processes

Motivation
Markov Decision Processes
Computing policies from a modelValue Functions
Mapping Value Functions to Policies
Computing Value Functions through Value Iteration
An Alternative Policy Iteration (appendix)
Summary

45
Appendix Policy Iteration

Idea Iteratively improve the policy
Policy Evaluation Given a policy pi calculate
Vi Vpi, the utility of each state if pi were
to be executed.
Policy Improvement Calculate a new maximum
expected utility policy pi1 using one-step look
ahead based on Vi.
pi improves at every step, converging if pi
pi1.
Computing Vi is simpler than for Value iteration
(no max)
Vt1(s) ? r(s, pi(s)) gV t(d(s, pi(s)))
Solve linear equations in O(N3)
Solve iteratively, similar to value iteration.

46
Markov Decision Processes

Motivation
Markov Decision Processes
Computing policies from a model
Value Iteration
Policy Iteration
Summary

47
Markov Decision Processes (MDPs)

Model
Finite set of states, S
Finite set of actions, A
Probabilistic state transitions, d(s,a)
Reward for each state and action, R(s,a)

Deterministic Example
48
Crib Sheet MDPs by Value Iteration

Insight Can calculate optimal values iteratively
using Dynamic Programming.
Algorithm
Iteratively calculate value using Bellmans
Equation
Vt1(s) ? maxa r(s,a) gV t(d(s, a))
Terminate when values are close enough
Vt1(s) - V t (s) lt e
Agent selects optimal action by one step
lookahead on V
p(s) argmaxa r(s,a) gV(d(s, a)

49
Ideas in this lecture

Objective is to accumulate rewards, rather than
goal states.
Objectives are achieved along the way, rather
than at the end.
Task is to generate policies for how to act in
all situations, rather than a plan for a single
starting situation.
Policies fall out of value functions, which
describe the greatest lifetime reward achievable
at every state.
Value functions are iteratively approximated.

50
How Might a Mouse Search a Maze for Cheese?
Cheese