Title: Dynamic Programming as Sequential Decision Making
1Dynamic Programming as Sequential Decision Making
- Lecture 20 Dynamic Programming as Sequential
Decision Making - Sequential Decision Making
- Example Inventory Control
- Value of Information
- Dynamic Programming Algorithm
- Principle of Optimality
- Deterministic Finite State Systems
- Shortest Paths
- Lecture 18 Introduction to Dynamic Programming
- Structure of Dynamic Programs
- Calculating Binomial Coefficient
- Pascals Triangle
- Coins Revisited
- Lecture 19 DNA Sequence Alignment
- Review Primer of Genome Science Handout
- Needleman/Wunsch example
- Review Assignment 5
Reference Dynamic Programming and Optimal
Control, D. Bertsekas
2Sequential Decision Making
- Given a discrete-time system (difference
equation) - wk are disturbances over which we have no control
- Model the impact of external influences by
independent identically distributed random
variables (think a roll of the dice) - uk are the control variables we can choose from
an admissible set
Is this sequence well defined?
3Sequential Decision Making
- Given a discrete-time system (difference
equation) - wk are disturbances over which we have no control
- Model the impact of external influences by
independent identically distributed random
variables (think a roll of the dice) - uk are the control variables we can choose from
an admissible set
Does this make sense?
4Sequential Decision Making
- Since wk is random, what does it mean to choose
uk to optimize the value of the cost function?
Does this make sense?
5Sequential Decision Making
- Given a discrete-time system (difference
equation) - wk are disturbances over which we have no control
- Model the impact of external influences by
independent identically distributed random
variables (think a roll of the dice) - uk are the control variables we can choose from
an admissible set
- Choose uk to optimize an additive cost function
6Example Inventory Management
- Inventory Management is one primary function of
- Enterprise Resource Planning (ERP) software
- Model the system
- xk is the stock available at the beginning of
period k - uk is the stock ordered (and immediately
delivered) at the beginning of the kth period,
ukgt0 - wk is the demand during the kth period, with
known probability distribution - Excess demand (resulting in negative values of x)
is backlogged and filled as soon as inventory is
available
7Example Inventory Management
- Inventory Management is one primary function of
- Enterprise Resource Planning (ERP) software
- Define cost function
- Pay r(xk) for either storing excess inventory
(xkgt0), or for shortage costs (xklt0) - Pay purchasing cost cuk where c is cost per unit
ordered - Pay an End of Season cost, R(xN), for inventory
left after N periods
8Example Inventory Management
- Inventory Management is one primary function of
- Enterprise Resource Planning (ERP) software
- Want to choose (u0, u1, , uN-1) to minimize the
total expected cost. Two important ways of doing
this - Open Loop Control Decide (u0, u1, , uN-1) all
at once - Closed Loop Control Use xk to improve each
decision. Need to find a function at each time k
to map xk to uk , ukmk(xk). Decide (m0, m1, ,
mN-1) all at once, not u
9Example Inventory Management
- Inventory Management is one primary function of
- Enterprise Resource Planning (ERP) software
- Want to choose (u0, u1, , uN-1) to minimize the
total expected cost
10Value of Information
11Dynamic Programming Algorithm
- Principle of Optimality (why DP works)
- Let pm0, m1, , mN-1 be an optimal policy
for the basic problem - Suppose that when using p, a state xi occurs at
time i with some probability - Consider the subproblem starting from xi at time
i minimizing the cost-to-go from i to N - Then, the truncated policy mi, mi1, , mN-1
is optimal for the subproblem.
Why?
12Optimality in Driving
Shortest route from AF to Provo passes through
Orem, so...
13Optimality in Driving
Shortest route from AF to Orem follows the
shortest route from AF to Provo.
14Dynamic Programming Algorithm
- Principle of Optimality (why DP works)
- Let pm0, m1, , mN-1 be an optimal policy
for the basic problem - Suppose that when using p, a state xi occurs at
time i with some probability - Consider the subproblem starting from xi at time
i minimizing the cost-to-go from i to N - Then, the truncated policy mi, mi1, , mN-1
is optimal for the subproblem.
Prove it!
15Dynamic Programming Algorithm
- Re-consider the Inventory Management example
- Use Principle of Optimality, working backwards in
time - Period N-1 assume xN-1 is given
J(xN-1)ER(xN)r(xN-1)cuN-1 - r(xN-1) is fixed, regardless of choice
for uN-1 - choose uN-1gt0 to minimize
- JN-1(xN-1 ) r(xN-1) cuN-1 ER(xN)
r(xN-1) cuN-1 ER(xN-1uN-1-wN-1) - Need to compute J for all values of
xN-1, get m(xN-1) - Period N-2 assume xN-2 is given
- choose uN-1gt0 to minimize
- JN-2 (xN-2 ) r(xN-2)cuN-2EJN-1(xN-2u
N-2-wN-2 )?m(xN-2) - Period k Jk (xk ) r(xk) min cuk
EJk1(xkuk-wk ) ?m(xk)
16Dynamic Programming Algorithm
- Theorem For every initial state x0, the
optimal cost J(x0) of the basic problem is equal
to J0(x0), where the function J0 is given by the
last step of the following algorithm, which
proceeds backward in time from period N-1 to
period 0 - The uk that minimizes the right hand side,
given xk, is a function mk (xk) for each k, and
the policy pm0, m1, , mN-1 is optimal.
17Special Case Deterministic Finite-State Systems
- Suppose wk is fixed to take on only one value
- There is no value of information, uk mk (xk), xk
is determined from x0 and previous controls - No need for feedback
- Choose u0, u1, , uN-1 directly, instead of
mi, mi1, , mN-1 - Suppose state space is finite i.e. xk is chosen
from a finite set for every k - Given a state xk , a control uk is associated
with the transition fk(xk,,uk) and a cost
gk(xk,,uk) - Equivalently represented as a graph
- Nodes are states
- Edges are transitions
- Every edge has a cost associated with it
18Special Case Deterministic Finite-State Systems
DP Shortest Path!
19Special Case Deterministic Finite-State Systems
5
Senine
4
Seon
1
3
Initial state
0
Shum
Artificial Terminal Node
2
0
Limnah
1
0
Stage 1
Stage 2
Stage N-1
Stage N
Stage 0
Amount to make change N7
Edge weights are 1 except when transitioning from
zero to zero, in which case theyre zero.
Shortest path is optimal!
20Dynamic Programming as Sequential Decision Making
- Life can only be understood going backwards,
- But it must be lived going forwards.
- Kierkegaard