Markov Decision Processes: Reactive Planning to Maximize Reward

About This Presentation
Title:

Markov Decision Processes: Reactive Planning to Maximize Reward

Description:

Read AIMA Chapter 17, Sections 1 3. This lecture based on development in: ... MDP Examples: TD-Gammon [Tesauro, 1995] Learning Through Reinforcement ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 51
Provided by: johnc106
Learn more at: http://web.mit.edu

less

Transcript and Presenter's Notes

Title: Markov Decision Processes: Reactive Planning to Maximize Reward


1
Markov Decision ProcessesReactive Planning to
Maximize Reward
Brian C. Williams 16.410 November 8th, 2004
Slides adapted from Manuela Veloso, Reid
Simmons, Tom Mitchell, CMU
4/1/2014
2
Reading and Assignments
  • Markov Decision Processes
  • Read AIMA Chapter 17, Sections 1 3.
  • This lecture based on development in Machine
    Learning by Tom Mitchell Chapter 13
    Reinforcement Learning

3
How Might a Mouse Search a Maze for Cheese?
Cheese
  • State Space Search?
  • As a Constraint Satisfaction Problem?
  • Goal-directed Planning?
  • Linear Programming?
  • What is missing?

4
Ideas in this lecture
  • Problem is to accumulate rewards, rather than to
    achieve goal states.
  • Approach is to generate reactive policies for how
    to act in all situations, rather than plans for a
    single starting situation.
  • Policies fall out of value functions, which
    describe the greatest lifetime reward achievable
    at every state.
  • Value functions are iteratively approximated.

5
MDP Examples TD-Gammon Tesauro, 1995Learning
Through Reinforcement
  • Learns to play Backgammon
  • States
  • Board configurations (1020)
  • Actions
  • Moves
  • Rewards
  • 100 if win
  • - 100 if lose
  • 0 for all other states
  • Trained by playing 1.5 million games against
    self.
  • Currently, roughly equal to best human player.

6
MDP Examples Aerial Robotics Feron et
al.Computing a Solution from a Continuous Model
7
Markov Decision Processes
  • Motivation
  • What are Markov Decision Processes (MDPs)?
  • Models
  • Lifetime Reward
  • Policies
  • Computing Policies From a Model
  • Summary

8
MDP Problem
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime reward
9
MDP Problem Model
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime reward
10
Markov Decision Processes (MDPs)
Process
  • Model
  • Finite set of states, S
  • Finite set of actions, A
  • (Probabilistic) state transitions, d(s,a)
  • Reward for each state and action, R(s,a)
  • Observe state st in S
  • Choose action at in A
  • Receive immediate reward rt
  • State changes to st1

s0
s1
r0
  • Legal transitions shown
  • Reward on unlabeled transitions is 0.

G
11
MDP Environment Assumptions
  • Markov Assumption Next state and reward is a
    function only of the current state and action
  • st1 d(st, at)
  • rt r(st, at)
  • Uncertain and Unknown Environment
  • d and r may be nondeterministic and unknown

12
MDP Nondeterministic Example
Today we only considerthe deterministic case
R Research D Development
13
MDP Problem Model
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime reward
14
MDP Problem Lifetime Reward
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime reward
15
Lifetime Reward
  • Finite horizon
  • Rewards accumulate for a fixed period.
  • 100K 100K 100K 300K
  • Infinite horizon
  • Assume reward accumulates for ever
  • 100K 100K . . . infinity
  • Discounting
  • Future rewards not worth as much(a bird in hand
    )
  • Introduce discount factor g100K g 100K g 2
    100K. . . converges
  • Will make the math work

16
MDP Problem Lifetime Reward
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
17
MDP Problem Policy
Agent
State
Action
Reward
Environment
s0
Given an environment model as a MDP create a
policy for acting that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
18
Assume deterministic world
  • Policy p S ?A
  • Selects an action for each state.
  • Optimal policy p S ?A
  • Selects action for each state that maximizes
    lifetime reward.

19
  • There are many policies, not all are necessarily
    optimal.
  • There may be several optimal policies.

20
Markov Decision Processes
  • Motivation
  • What are Markov Decision Processes (MDPs)?
  • Models
  • Lifetime Reward
  • Policies
  • Computing Policies From a Model
  • Summary

21
Markov Decision Processes
  • Motivation
  • Markov Decision Processes
  • Computing Policies From a Model
  • Value Functions
  • Mapping Value Functions to Policies
  • Computing Value Functions through Value Iteration
  • An Alternative Policy Iteration (appendix)
  • Summary

22
Value Function Vp for a Given Policy p
  • Vp(st) is the accumulated lifetime reward
    resulting from starting in state st and
    repeatedly executing policy p
  • Vp(st) rt g rt1 g 2 rt2 . . .
  • Vp(st) ?i g i rtIwhere rt, rt1 ,
    rt2 . . . are generated by following p,
    starting at st .

Vp
9
9
10
Assume g .9
10
10
0
23
An Optimal Policy p Given Value Function V
  • Idea Given state s
  • Examine all possible actions ai in state s.
  • Select action ai with greatest lifetime reward.
  • Lifetime reward Q(s, ai) is
  • the immediate reward for taking action r(s,a)
  • plus life time reward starting in target state V(
    d(s, a) )
  • discounted by g.
  • p(s) argmaxa r(s,a) gV( d(s, a) )
  • Must Know
  • Value function
  • Environment model.
  • d S x A ? S
  • r S x A ? ?

p
9
9
10
10
G
10
10
10
10
0
24
Example Mapping Value Function to Policy
  • Agent selects optimal action from V
  • p(s) argmaxa r(s,a) gV(d(s, a)

Model V
g 0.9
100
100
25
Example Mapping Value Function to Policy
  • Agent selects optimal action from V
  • p(s) argmaxa r(s,a) gV(d(s, a)

Model V
g 0.9
a
90
100
0
100
G
  • a 0 0.9 x 100 90
  • b 0 0.9 x 81 72.9
  • select a

b
100
81
90
100
26
Example Mapping Value Function to Policy
  • Agent selects optimal action from V
  • p(s) argmaxa r(s,a) gV(d(s, a)

Model V
g 0.9
90
100
0
100
G
a
  • a 100 0.9 x 0 100
  • b 0 0.9 x 90 81
  • select a

b
100
81
90
100
p
G
27
Example Mapping Value Function to Policy
  • Agent selects optimal action from V
  • p(s) argmaxa r(s,a) gV(d(s, a)

Model V
g 0.9
90
100
0
100
G
  • a ?
  • b ?
  • c ?
  • select ?

a
b
100
81
90
100
c
p
G
28
Markov Decision Processes
  • Motivation
  • Markov Decision Processes
  • Computing Policies From a Model
  • Value Functions
  • Mapping Value Functions to Policies
  • Computing Value Functions through Value Iteration
  • An Alternative Policy Iteration
  • Summary

29
Value Function V for an optimal policy p
Example
  • Optimal value function for a one step horizon

V1(s) maxai r(s,ai)
30
Value Function V for an optimal policy p
Example
  • Optimal value function for a one step horizon
  • V1(s) maxai r(s,ai)
  • Optimal value function for a two step horizon

V2(s) maxai r(s,ai) gV 1(d(s, ai))
g
V1(SA)
RA
A
SA
SA
  • Instance of the Dynamic Programming Principle
  • Reuse shared sub-results
  • Exponential saving

B
. . .
SB
SB
V1(SB)
31
Value Function V for an optimal policy p
Example
  • Optimal value function for a one step horizon
  • V1(s) maxai r(s,ai)
  • Optimal value function for a two step horizon
  • V2(s) maxai r(s,ai) gV 1(d(s, ai))
  • Optimal value function for an n step horizon

Vn(s) maxai r(s,ai) gV n-1(d(s, ai))
32
Value Function V for an optimal policy p
Example
  • Optimal value function for a one step horizon
  • V1(s) maxai r(s,ai)
  • Optimal value function for a two step horizon
  • V2(s) maxai r(s,ai) gV 1(d(s, ai))
  • Optimal value function for an n step horizon
  • Vn(s) maxai r(s,ai) gV n-1(d(s, ai))
  • Optimal value function for an infinite horizon

V(s) maxai r(s,ai) gV(d(s, ai))
33
Solving MDPs by Value Iteration
  • Insight Can calculate optimal values iteratively
    using Dynamic Programming.
  • Algorithm
  • Iteratively calculate value using Bellmans
    Equation
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))
  • Terminate when values are close enough
  • Vt1(s) - V t (s) lt e
  • Agent selects optimal action by one step
    lookahead on V
  • p(s) argmaxa r(s,a) gV(d(s, a)

34
Convergence of Value Iteration
  • If terminate when values are close enough
  • Vt1(s) - V t (s) lt e
  • Then
  • Maxs in S Vt1(s) - V (s) lt 2eg/(1 - g)
  • Converges in polynomial time.
  • Convergence guaranteed even if updates are
    performed infinitely often, but asynchronously
    and in any order.

35
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
0
0
0
100
100
a
G
G
0
b
0
0
0
100
100
  • a 0 0.9 x 0 0
  • b 0 0.9 x 0 0
  • Max 0

36
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
0
0
0
100
100
G
G
0
100
a
c
b
0
0
0
100
100
  • a 100 0.9 x 0 100
  • b 0 0.9 x 0 0
  • c 0 0.9 x 0 0
  • Max 100

37
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
0
0
0
100
100
G
G
0
100
0
a
0
0
0
100
100
  • a 0 0.9 x 0 0
  • Max 0

38
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
0
0
0
100
100
G
G
0
100
0
0
0
0
100
100
0
39
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
0
0
0
100
100
G
G
0
100
0
0
0
0
100
100
0
0
40
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
0
0
0
100
100
G
G
0
100
0
0
0
0
100
100
0
0
100
41
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
100
100
G
90
100
0
100
100
0
90
100
42
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
100
100
G
90
100
0
100
100
81
90
100
43
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))

g 0.9
V t
V t1
100
100
G
90
100
0
100
100
81
90
100
44
Markov Decision Processes
  • Motivation
  • Markov Decision Processes
  • Computing policies from a modelValue Functions
  • Mapping Value Functions to Policies
  • Computing Value Functions through Value Iteration
  • An Alternative Policy Iteration (appendix)
  • Summary

45
Appendix Policy Iteration
  • Idea Iteratively improve the policy
  • Policy Evaluation Given a policy pi calculate
    Vi Vpi, the utility of each state if pi were
    to be executed.
  • Policy Improvement Calculate a new maximum
    expected utility policy pi1 using one-step look
    ahead based on Vi.
  • pi improves at every step, converging if pi
    pi1.
  • Computing Vi is simpler than for Value iteration
    (no max)
  • Vt1(s) ? r(s, pi(s)) gV t(d(s, pi(s)))
  • Solve linear equations in O(N3)
  • Solve iteratively, similar to value iteration.

46
Markov Decision Processes
  • Motivation
  • Markov Decision Processes
  • Computing policies from a model
  • Value Iteration
  • Policy Iteration
  • Summary

47
Markov Decision Processes (MDPs)
  • Model
  • Finite set of states, S
  • Finite set of actions, A
  • Probabilistic state transitions, d(s,a)
  • Reward for each state and action, R(s,a)

Deterministic Example
48
Crib Sheet MDPs by Value Iteration
  • Insight Can calculate optimal values iteratively
    using Dynamic Programming.
  • Algorithm
  • Iteratively calculate value using Bellmans
    Equation
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))
  • Terminate when values are close enough
  • Vt1(s) - V t (s) lt e
  • Agent selects optimal action by one step
    lookahead on V
  • p(s) argmaxa r(s,a) gV(d(s, a)

49
Ideas in this lecture
  • Objective is to accumulate rewards, rather than
    goal states.
  • Objectives are achieved along the way, rather
    than at the end.
  • Task is to generate policies for how to act in
    all situations, rather than a plan for a single
    starting situation.
  • Policies fall out of value functions, which
    describe the greatest lifetime reward achievable
    at every state.
  • Value functions are iteratively approximated.

50
How Might a Mouse Search a Maze for Cheese?
Cheese
  • By Value Iteration?
  • What is missing?
Write a Comment
User Comments (0)