Title: Distributed Planning in Hierarchical Factored MDPs
1Distributed Planning in Hierarchical Factored MDPs
- Carlos Guestrin
- Stanford University
- Geoffrey Gordon
- Carnegie Mellon University
2Multiagent Coordination Examples
- Search and rescue
- Factory management
- Supply chain
- Firefighting
- Network routing
- Air traffic control
- Access only local information
- Distributed Control
- Distributed Planning
3Hierarchical Decomposition
Part-of
Part-of
Cylinders
Chassis
Injection
Engine
Steering
Exhaust
- Subsystems can share variables
- Each subsystem only observes its local variables
- Parallel decomposition ! exponential state space
4Outline
- Object-based Representation
- Hierarchical Factored MDPs
- Distributed planning
- Message passing algorithm based on LP
decomposition - Hierarchical action selection mechanism
- Limited observability and communication
- Reusing plans and computation
- Exploit classes of objects
5Basic Subsystem MDP
- Subsystem j decomposed
- Internal variables Xj
- External variables Yj
- Actions Aj
- Subsystem model
- Rewards - Rj(Xj , Yj , Aj)
- Transitions - Pj (Xj Xj , Yj , Aj)
- Subsystem can be modeled with any representation
Speed control
Actions
External variables
Internal variables
?
?
6Hierarchical Subsystem Tree
- Subsystem tree
- Nodes are subsystems
- Hierarchical decomposition
- Tree reward sum subsystem rewards
- Consistent subsystem tree
- Running intersection property
- Consistent dynamics
- Lemma consistent subsystem tree yields
well-defined global MDP
SepSetM2 G , ?
M2 Speed control
SepSetM3 ?
M3 Cooling
7Relationship to Factored MDPs
Hierarchical Factored MDP
Multiagent Factored MDP Guestrin et al. 01
- Representational power equivalent
- Hierarchical factored MDP ? multiagent factored
MDP with particular choice of basis functions - New capabilities
- Fully distributed planning algorithm
- Reuse for knowledge representation
- Reuse of computation
- MDP counterpart to Object-Oriented Bayes Nets
(OOBNs) Koller and Pfeffer 97
8Planning for Hierarchical Factored MDPs
- Action space joint action a a1,, an for all
subsystems - State space joint state x of entire system
- Reward function total reward r
- Action and state spaces are exponential in
subsystems - Exploit hierarchical structure
- Efficient, distributed approximate planning
algorithm - Simple message passing approach
- Each subsystem accesses only its local model
- Each local model solved by any standard MDP
algorithm
9Solving MDPs as LPs
- Bellman constraint if x ?a y with reward r,
- V(x) ? V(y) r Q(a, x)
- Similarly for stochastic transitions
- Optimal V satisfies all Bellman constraints, and
is componentwise smallest
min V(x)V(y)V(z)V(g) st V(x) ? V(y)1 V(y) ?
V(g)3 V(x) ? V(z)2 V(z) ? V(g)1
10Decomposable Value Functions
Linear combination of restricted domain functions
Bellman et al. 63 Schweitzer Seidmann
85 Tsitsiklis Van Roy 96 Koller Parr
99,00 Guestrin et al. 01
- Each hi is status of small part(s) of a complex
system - Status of a machine and neighbors
- Load on machine
- Must find w giving good approximate value
function - Well-designed hi ? exponentially fewer parameters
11Approximate Linear Programming
- To solve subsystem tree MDP as LP
- Overall state is cross-product of subsystem
states - Bellman LP has exponentially many constraints,
variables - ? we need to approximate
- Write V(x) V1(X1) V2(X2) ...
- Minimize V1(X1) V2(X2) ... s.t.
- V1(X1) V2(X2) ... ? V1(Y1) V2(Y2) ...
- R1 R2 ...
- One variable Vi(Xi) for each state of each subsys
? - One constraint for every state and action ?
- Vi , Qi depend on small sets of variables/actions
? - Generates polynomially-sized LPs for factored
MDPs Guestrin et al. 01
12Overview of Algorithm
- Each subsystem solves a local (stand-alone) MDP
- Each subsystem computes messages by solving a
simple local LP - Sends constraint message to its parent
- Sends reward messages to its children
- Repeat until convergence
Ml
Reward message
Constraint message
Mj
Reward message
Constraint message
Mk
13Stand-alone MDPs and Reward Messages
Reward messages
Subsystem MDP
Stand-alone MDP
- Sj from parent
- Sk to children
- State (Xj , Yj)
- Actions Aj
- Rewards Rj(Xj , Yj , Aj)
- Transitions Pj (Xj Xj , Yj , Aj)
- State Xj
- Actions (Aj , Yj)
- Rewards Rj(Xj , Yj , Aj) Sj ?k Sk
- Transitions Pj (Xj Xj , Yj , Aj)
- Reward messages are over SepSets
- Solve stand-alone MDP using any algorithm
- Obtain visitation frequencies of resulting
policy - ?j discounted frequency of visits to each
state-action
14Visitation Frequencies
Dual
- Discounted frequency of visits to each state
action pairs - Subsystems must agree on the frequency for shared
variables ! reward messages - Approx. ! relaxed enforcement of constraints
15Overview of Algorithm Detailed
- Each subsystem solves a local (stand-alone) MDP
- Compute local visitation frequencies ?j
- Add constraint to reward message LP
- Each subsystem computes messages by solving a
simple local LP - Sends constraint message to its parent
visitation frequencies for SepSet variables - Sends reward messages to its children
- Repeat until convergence
Ml
Mj
Mk
16Reward Message LP
Dual
- LP yields reward messages Sk for children
- Dual yields mixing weights pj , pk ? enforce
consistent frequencies
17Computing Reward Messages
Rows of ?jj and Lj correspond to visitation
frequencies and value of each policy visited by
Mj
Rows of ?jk are frequencies marginalized to
SepSetMk
Messages
- Dual of reward message LP generates mixed
policies - pj and pk are mixing parameters, force parents
and children to agree on visitation of SepSet
18Convergence Result
In finite number of iterations, algorithm
produces best possible value function (ie, same
as centralized planner)
- Planning algorithm is a special case of nested
Benders decomposition - One Benders split for each internal node N of
subsystem tree - One subproblem is N itself
- Remaining subproblems are subtrees for Ns
children (decompose these recursively) - Master prob is to determine reward messages
- Result follows from correctness of Benders
decomposition
19Hierarchical Action Selection
- Distributed planning obtains value function
- Distributed message passing obtains action choice
(policy) - Sends conditional value to its parent
- Sends action choice to its children
- Limited observability
- Limited communication
Ml
Action choice
Value of conditional policy
Mj
Action choice
Value of conditional policy
Mk
20Reusing Models and Computation
- Classes of objects
- Basic subsystems with same rewards and
transitions - Reuse in knowledge representation
- Library of subsystems
- Reusing computation
- Compute policy (visitation frequencies) for one
subsystem, use it in all subsystems of the same
class - Compute messages for one subtree, use them in all
equivalent subtrees
21Related Work
- Serial decompositions
- one subsystem active at a time
- Kushner Chen 74 (rooms in a maze)
- Dean Lin, IJCAI-95 (combines w/ abstraction)
- hierarchical is similar (MAXQ, HAM, etc.)
- Parallel decompositions
- more expressive (exponentially larger state
space) - Singh Cohn, NIPS-98 (enumerates states)
- Meuleau et al., AAAI-98 (heuristic for resources)
22Related Work
- Dantzig-Wolfe or Benders decomposition
- Dantzig 65
- first used for MDPs in Kushner Chen 74
- we are first to apply to parallel subsystems
- Variable elimination
- well-known from Bayes nets
- Guestrin, Koller Parr NIPS-01
23Summary Hierarchical Factored MDPs
- Parallel decomposition ! Exponential state space
- Efficient distributed planning algorithm
- Solve local stand-alone MDPs with any algorithm
- Reward sharing coordinate subsystem plans
- Simple message passing algorithm computes rewards
- Hierarchical action selection
- Limited communication
- Limited observability
- Reuse for knowledge representation and
computation - General approach for modeling and planning in
large stochastic systems