Distributed Planning in Hierarchical Factored MDPs

1 / 23

About This Presentation

Title:

Distributed Planning in Hierarchical Factored MDPs

Description:

Speed control. S. External. variables. Actions. Subsystem j decomposed: ... Well-designed hi exponentially fewer parameters. Approximate Linear Programming ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 24

Provided by: carlosg94

more less

Transcript and Presenter's Notes

Title: Distributed Planning in Hierarchical Factored MDPs

1
Distributed Planning in Hierarchical Factored MDPs

Carlos Guestrin
Stanford University
Geoffrey Gordon
Carnegie Mellon University

2
Multiagent Coordination Examples

Search and rescue
Factory management
Supply chain
Firefighting
Network routing
Air traffic control

Access only local information
Distributed Control
Distributed Planning

3
Hierarchical Decomposition
Part-of
Part-of
Cylinders
Chassis
Injection
Engine
Steering
Exhaust

Subsystems can share variables
Each subsystem only observes its local variables
Parallel decomposition ! exponential state space

4
Outline

Object-based Representation
Hierarchical Factored MDPs
Distributed planning
Message passing algorithm based on LP
decomposition
Hierarchical action selection mechanism
Limited observability and communication
Reusing plans and computation
Exploit classes of objects

5
Basic Subsystem MDP

Subsystem j decomposed
Internal variables Xj
External variables Yj
Actions Aj
Subsystem model
Rewards - Rj(Xj , Yj , Aj)
Transitions - Pj (Xj Xj , Yj , Aj)
Subsystem can be modeled with any representation

Speed control
Actions
External variables
Internal variables
?
?
6
Hierarchical Subsystem Tree

Subsystem tree
Nodes are subsystems
Hierarchical decomposition
Tree reward sum subsystem rewards
Consistent subsystem tree
Running intersection property
Consistent dynamics
Lemma consistent subsystem tree yields
well-defined global MDP

SepSetM2 G , ?
M2 Speed control
SepSetM3 ?
M3 Cooling
7
Relationship to Factored MDPs
Hierarchical Factored MDP
Multiagent Factored MDP Guestrin et al. 01

Representational power equivalent
Hierarchical factored MDP ? multiagent factored
MDP with particular choice of basis functions
New capabilities
Fully distributed planning algorithm
Reuse for knowledge representation
Reuse of computation
MDP counterpart to Object-Oriented Bayes Nets
(OOBNs) Koller and Pfeffer 97

8
Planning for Hierarchical Factored MDPs

Action space joint action a a1,, an for all
subsystems
State space joint state x of entire system
Reward function total reward r
Action and state spaces are exponential in
subsystems
Exploit hierarchical structure
Efficient, distributed approximate planning
algorithm
Simple message passing approach
Each subsystem accesses only its local model
Each local model solved by any standard MDP
algorithm

9
Solving MDPs as LPs

Bellman constraint if x ?a y with reward r,
V(x) ? V(y) r Q(a, x)
Similarly for stochastic transitions
Optimal V satisfies all Bellman constraints, and
is componentwise smallest

min V(x)V(y)V(z)V(g) st V(x) ? V(y)1 V(y) ?
V(g)3 V(x) ? V(z)2 V(z) ? V(g)1
10
Decomposable Value Functions
Linear combination of restricted domain functions
Bellman et al. 63 Schweitzer Seidmann
85 Tsitsiklis Van Roy 96 Koller Parr
99,00 Guestrin et al. 01

Each hi is status of small part(s) of a complex
system
Status of a machine and neighbors
Load on machine
Must find w giving good approximate value
function
Well-designed hi ? exponentially fewer parameters

11
Approximate Linear Programming

To solve subsystem tree MDP as LP
Overall state is cross-product of subsystem
states
Bellman LP has exponentially many constraints,
variables
? we need to approximate
Write V(x) V1(X1) V2(X2) ...
Minimize V1(X1) V2(X2) ... s.t.
V1(X1) V2(X2) ... ? V1(Y1) V2(Y2) ...
R1 R2 ...
One variable Vi(Xi) for each state of each subsys
?
One constraint for every state and action ?
Vi , Qi depend on small sets of variables/actions
?
Generates polynomially-sized LPs for factored
MDPs Guestrin et al. 01

12
Overview of Algorithm

Each subsystem solves a local (stand-alone) MDP
Each subsystem computes messages by solving a
simple local LP
Sends constraint message to its parent
Sends reward messages to its children
Repeat until convergence

Ml
Reward message
Constraint message

Mj
Reward message
Constraint message

Mk
13
Stand-alone MDPs and Reward Messages
Reward messages
Subsystem MDP
Stand-alone MDP

Sj from parent
Sk to children

State (Xj , Yj)
Actions Aj
Rewards Rj(Xj , Yj , Aj)
Transitions Pj (Xj Xj , Yj , Aj)

State Xj
Actions (Aj , Yj)
Rewards Rj(Xj , Yj , Aj) Sj ?k Sk
Transitions Pj (Xj Xj , Yj , Aj)

Reward messages are over SepSets
Solve stand-alone MDP using any algorithm
Obtain visitation frequencies of resulting
policy
?j discounted frequency of visits to each
state-action

14
Visitation Frequencies
Dual

Discounted frequency of visits to each state
action pairs
Subsystems must agree on the frequency for shared
variables ! reward messages
Approx. ! relaxed enforcement of constraints

15
Overview of Algorithm Detailed

Each subsystem solves a local (stand-alone) MDP
Compute local visitation frequencies ?j
Add constraint to reward message LP
Each subsystem computes messages by solving a
simple local LP
Sends constraint message to its parent
visitation frequencies for SepSet variables
Sends reward messages to its children
Repeat until convergence

Ml

Mj

Mk
16
Reward Message LP
Dual

LP yields reward messages Sk for children
Dual yields mixing weights pj , pk ? enforce
consistent frequencies

17
Computing Reward Messages
Rows of ?jj and Lj correspond to visitation
frequencies and value of each policy visited by
Mj
Rows of ?jk are frequencies marginalized to
SepSetMk
Messages

Dual of reward message LP generates mixed
policies
pj and pk are mixing parameters, force parents
and children to agree on visitation of SepSet

18
Convergence Result
In finite number of iterations, algorithm
produces best possible value function (ie, same
as centralized planner)

Planning algorithm is a special case of nested
Benders decomposition
One Benders split for each internal node N of
subsystem tree
One subproblem is N itself
Remaining subproblems are subtrees for Ns
children (decompose these recursively)
Master prob is to determine reward messages
Result follows from correctness of Benders
decomposition

19
Hierarchical Action Selection

Distributed planning obtains value function
Distributed message passing obtains action choice
(policy)
Sends conditional value to its parent
Sends action choice to its children
Limited observability
Limited communication

Ml
Action choice
Value of conditional policy

Mj
Action choice
Value of conditional policy

Mk
20
Reusing Models and Computation

Classes of objects
Basic subsystems with same rewards and
transitions
Reuse in knowledge representation
Library of subsystems
Reusing computation
Compute policy (visitation frequencies) for one
subsystem, use it in all subsystems of the same
class
Compute messages for one subtree, use them in all
equivalent subtrees

21
Related Work