Title: Online Sampling for Markov Decision Processes
1Online SamplingforMarkov Decision Processes
- Bob Givan
- Joint work w/ E. K. P. Chong, H. Chang, G. Wu
Electrical and Computer Engineering Purdue
University
2Markov Decision Process (MDP)
- Ingredients
- System state x in state space X
- Control action a in A(x)
- Reward R(x,a)
- State-transition probability P(x,y,a)
- Find control policy to maximize objective fun
3Optimal Policies
- Policy mapping from state and time to actions
- Stationary Policy mapping from state to actions
- Goal a policy maximizing the objective function
- VH(x0) max Obj R(x0,a0), , R(xH-1,aH-1)
- where the max is over all policies u
u0,,uH-1 - For large H, a0 independent of H. (w/ergodicity
assum.) - Stationary optimal action a0 for H ? via
receding horizon control
4Q Values
- ? Fix a large H, focus on finite-horizon reward
- Define Q(x,a) R(x,a) EVH-1(y)
- Utility of action a at state x.
- Name Q-value of action a at state x.
- Key identities (Bellmans equations)
- VH(x) maxa Q(x,a)
- ?0(x) argmaxa Q(x,a)
5Solution Methods
- Recall
- u0(x) argmaxa Q(x,a)
- Q(x,a) R(x,a) E VH-1(y)
- Problem Q-value depends on optimal policy.
- State space is extremely large (often continuous)
- Two-pronged solution approach
- Apply a receding-horizon method
- Estimate Q-values via simulation/sampling
6Methods for Q-value Estimation
- Previous work by other authors
- Unbiased sampling (exact Q value)Kearns et al.,
IJCAI-99 - Policy rollout (lower bound) Bertsekas
Castanon, 1999 - Our techniques
- Hindsight optimization (upper bound)
- Parallel rollout (lower bound)
7Expectimax Tree for V
8Unbiased Sampling
9Unbiased Sampling (Contd)
- For a given desired accuracy, how largeshould
sampling width and depth be? - Answered Kearns, Mansour, and Ng (1999)
- Requires prohibitive sampling width and depth
- e.g. C ? 108, Hs gt 60 to distinguish best and
worst policies in our scheduling domain - We evaluate with smaller width and depth
10How to Look Deeper?
11Policy Roll-out
12Policy Rollout in Equations
- Write VHu (y) for the value of following policy u
- Recall Q(x,a) R(x,a) E VH-1(y)
- R(x,a) E maxu
VH-1u(y) - Given a base policy u, use
- R(x,a) E VH-1u(y)
- as an lower bound estimate of Q-value.
- Resulting policy is PI(u), given infinite sampling
13Policy Roll-out (contd)
14Parallel Policy Rollout
- Generalization of policy rollout, due toChang,
Givan, and Chong, 2000 - Given a set U of base policies, use
- R(x,a) E maxu?U VH-1u(y)
- as an estimate of Q-value
- More accurate estimate than policy rollout
- Still gives a lower bound to true Q-value
- Still gives a policy no worse than any in U
15Hindsight Optimization Tree View
16Hindsight Optimization Equations
- Swap Max and Exp in expectimax tree.
- Solve each off-line optimization problem
- O (kC f(H)) time
- where f(H) is the offline problem complexity
- Jensens inequality implies upper bounds
17Hindsight Optimization (Contd)
18Application to Example Problems
- Apply unbiased sampling, policy rollout, parallel
rollout, and hindsight optimization to - Multi-class deadline scheduling
- Random early dropping
- Congestion control
19Basic Approach
- Traffic model provides a stochastic description
of possible future outcomes - Method
- Formulate network decision problems as POMDPs by
incorporating traffic model - Solve belief-state MDP online using
sampling(choose time-scale to allow for
computation time)
20Domain 1 Deadline Scheduling
Objective Minimize weighted loss
21Domain 2 Random Early Dropping
Objective Minimize delaywithout sacrificing
throughput
22Domain 3 Congestion Control
23Traffic Modeling
- A Hidden Markov Model (HMM) for each source
- Note state is hidden, model is partially
observed
24Deadline Scheduling Results
- Non-sampling Policies
- EDF earliest deadline first.
- Deadline sensitive, class insensitive.
- SP static priority.
- Deadline insensitive, class sensitive.
- CM current minloss Givan et al., 2000
- Deadline and class sensitive.
- Minimizes weighted loss for the current packets.
25Deadline Scheduling Results
- Objective minimize weighted loss
- Comparison
- Non-sampling policies
- Unbiased sampling (Kearns et al.)
- Hindsight optimization
- Rollout with CM as base policy
- Parallel rollout
- Results due to H. S. Chang
26Deadline Scheduling Results
27Deadline Scheduling Results
28Deadline Scheduling Results
29Random Early Dropping Results
- Objective minimize delay subject to throughput
loss-tolerance - Comparison
- Candidate policies RED and buffer-k
- KMN-sampling
- Rollout of buffer-k
- Parallel rollout
- Hindsight optimization
- Results due to H. S. Chang.
30Random Early Dropping Results
31Random Early Dropping Results
32Congestion Control Results
- MDP Objective minimize weighted sum of
throughput, delay, and loss-rate - Fairness is hard-wired
- Comparisons
- PD-k (proportional-derivative with k target
queue) - Hindsight optimization
- Rollout of PD-k parallel rollout
- Results due to G. Wu, in progress
33Congestion Control Results
34Congestion Control Results
35Congestion Control Results
36Congestion Control Results
37Results Summary
- Unbiased sampling cannot cope
- Parallel rollout wins in 2 domains
- Not always equal to simple rollout of one base
policy - Hindsight optimization wins in 1 domain
- Simple policy rollout the cheapest method
- Poor in domain 1
- Strong in domain 2 with best base policy but
how to find this policy? - So-so in domain 3 with any base policy
38Talk Summary
- Case study of MDP sampling methods
- New methods offering practical improvements
- Parallel policy rollout
- Hindsight optimization
- Systematic methods for using traffic models to
help make network control decisions - Feasibility of real-time implementation depends
on problem timescale
39Ongoing Research
- Apply to other control problems (different
timescales) - Admission/access control
- QoS routing
- Link bandwidth allotment
- Multiclass connection management
- Problems arising in proxy-services
- Diagnosis and recovery
40Ongoing Research (Contd)
- Alternative traffic models
- Multi-timescale models
- Long-range dependent models
- Closed-loop traffic
- Fluid models
- Learning traffic model online
- Adaptation to changing traffic conditions
41Congestion Control (Contd)
42Congestion Control Results
43Hindsight Optimization (Contd)
44Policy Rollout (Contd)
Policy-performance
Base Policy
45Receding-horizon Control
- For large horizon H, policy is stationary.
- At each time, if state is x, then apply action
- u(x) argmaxa Q(x,a)
- argmaxa R(x,a) E
VH-1(y) - Compute estimate of Q-value at each time.
46Congestion Control (Contd)
47Domain 3 Congestion Control
High-priority Traffic
Bottleneck
Node
Best-effort Traffic
- Resources Bandwidth and buffer
- Objective optimize throughput, delay, loss, and
fairness
- High-priority traffic
- Open-loop controlled
- Low-priority traffic
- Closed-loop controlled
48Congestion Control Results
49Congestion Control Results
50Congestion Control Results
51Congestion Control Results