Title: Multiagent Planning with Factored MDPs
1Multiagent Planning with Factored MDPs
- Carlos Guestrin
- Stanford University
2Collaborative Multiagent Planning
Collaborative Multiagent Planning
Long-term goals
Multiple agents
Coordinated decisions
- Search and rescue
- Factory management
- Supply chain
- Firefighting
- Network routing
- Air traffic control
3Exploiting Structure
- Real-world problems have
- Hundreds of objects
- Googles of states
- Real-world problems have structure!
Approach Exploit structured representation to
obtain efficient approximate solution
4peasant
footman
- Real-time Strategy Game
- Peasants collect resources and build
- Footmen attack enemies
- Buildings train peasants and footmen
building
5Joint Decision Space
Markov Decision Process (MDP) Representation
- State space
- Joint state x of entire system
- Action space
- Joint action a a1,, an for all agents
- Reward function
- Total reward R(x,a)
- Transition model
- Dynamics of the entire system P(xx,a)
6Policy
At state x, action a for all agents
Policy ?(x) a
7Value of Policy
Expected long-term reward starting from x
Value V?(x)
?(x0)
8Optimal Long-term Plan
Optimal value function V(x)
Optimal Policy ?(x)
Bellman Equations
9Solving an MDP
Solve Bellman equation
Optimal value V(x)
Optimal policy ?(x)
Many algorithms solve the Bellman equations
- Policy iteration Howard 60, Bellman 57
- Value iteration Bellman 57
- Linear programming Manne 60
10LP Solution to MDP
Manne 60
- Value computed by linear programming
å
minimize
x
ì
)
,
(
x
a
Q
³
subject to
í
î
- One variable V (x) for each state
- One constraint for each state x and action a
- Polynomial time solution
11Planning under Bellmans Curse
- Planning is Polynomial in states and actions
- states exponential in number of variables
- actions exponential in number of agents
Efficient approximation by exploiting structure!
12Structure in Representation Factored MDP
Boutilier et al. 95
P(FF,G,AB,AF)
- State
- Dynamics
- Decisions
- Rewards
Complexity of representation Exponential in
parents (worst case)
13Structured Value function ?
Factored MDP ? Structure in V
Factored MDP Structure in V
?
Structured V yields good approximate value
function
14Structured Value Functions
Linear combination of restricted domain functions
Bellman et al. 63 Tsitsiklis Van Roy
96 Koller Parr 99,00 Guestrin et al.
01
å
V
)
(
)
(
x
x
i
- Each hi is status of small part(s) of a complex
system - State of footman and enemy
- Status of barracks
- Status of barracks and state of footman
- Structured V ? Structured Q
- Must find w giving good approximate value function
15Approximate LP Solution
Schweitzer and Seidmann 85
³
- One variable wi for each basis function ?
- Polynomial number of LP variables
- One constraint for every state and action ?
- Exponentially many LP constraints
16Representing Exponentially Many Constraints
Guestrin, Koller, Parr 01
Exponentially many linear one nonlinear
constraint
Maximization over exponential space ?
17Variable Elimination
Structured Value Function ?
- Variable elimination to maximize over state space
Bertele Brioschi 72
Here we need only 23, instead of 63 sum operations
- Maximization only exponential in largest factor
- Tree-width characterizes complexity
- Graph-theoretic measure of connectedness
- Arises in many settings integer prog., Bayes
nets, comput. geometry,
18Variable Elimination
Structured Value Function ?
small of Ais, Xjs
small of Xjs
- Variable elimination to maximize over state space
Bertele Brioschi 72
Here we need only 23, instead of 63 sum operations
- Maximization only exponential in largest factor
- Tree-width characterizes complexity
- Graph-theoretic measure of connectedness
- Arises in many settings integer prog., Bayes
nets, comput. geometry,
19Representing the Constraints
- Use Variable Elimination to represent constraints
Number of constraints exponentially smaller!
20Understanding Scaling Properties
Explicit LP Factored LP k tree-width
2n (n1-k)2k
21Network Management Problem
- Computer status good, dead, faulty
- Dead neighbors increase dying probability
- Computer runs processes
- Reward for successful processes
- Each SysAdmin takes local action reboot, not
reboot
Problem with n machines ? 9n states, 2n actions
Ring
Ring of Rings
Star
k-grid
22Running Time
k tree-width
23Summary of Algorithm
- Pick local basis functions hi
- Factored LP computes value function
- Policy is argmaxa of Q
24Large-scale Multiagent Coordination
- Efficient algorithm computes V
- Action at state x is
- actions is exponential ?
- Complete observability ?
- Full communication ?
25Distributed Q Function
Guestrin, Koller, Parr 02
Distributed Q function
Each agent maintains a part of the Q function
Q(A1,,A4, X1,,X4)
Q2(A1, A2, X1,X2)
Q1(A1, A4, X1,X4)
Q4(A3, A4, X3,X4)
Q3(A2, A3, X2,X3)
26Multiagent Action Selection
Instantiate current state x
Maximal action argmaxa
Distributed Q function
Q2(A1, A2, X1,X2)
Q1(A1, A4, X1,X4)
Q3(A2, A3, X2,X3)
Q4(A3, A4, X3,X4)
27Instantiate Current State x
Instantiate current state x
Limited observability ? agent i only observes
variables in Qi
Q2(A1, A2, X1,X2)
Q2(A1, A2)
Q1(A1, A4, X1,X4)
Q1(A1, A4)
Q3(A2, A3, X2,X3)
Q3(A2, A3)
Q4(A3, A4, X3,X4)
Q4(A3, A4)
28Multiagent Action Selection
Instantiate current state x
Maximal action argmaxa
Distributed Q function
Q2(A1, A2)
Q1(A1, A4)
Q3(A2, A3)
Q4(A3, A4)
29Coordination Graph
maxa
- Use variable elimination for maximization
Q2(A1, A2)
Q1(A1, A4)
Q3(A2, A3)
A2 A4 Value of optimal A3 action
Attack Attack 5
Attack Defend 6
Defend Attack 8
Defend Defend 12
- Limited communication ? for optimal action choice
- Comm. bandwidth tree-width of coord. graph
Q4(A3, A4)
30Coordination Graph Example
A1
- Trees dont increase communication requirements
- Cycles require graph triangulation
A4
A5
31Unified View Function Approximation ?
Multiagent Coordination
Factored MDP and value function representations
induce communication, coordination
Q1(A1, A4, X1,X4) Q2(A1, A2, X1,X2) Q3(A2,
A3, X2,X3) Q4(A3, A4, X3,X4)
32How good are the policies?
- SysAdmin problem
- Power grid problem Schneider et al. 99
33SysAdmin Ring - Quality of Policies
34Power Grid Factored Multiagent
Guestrin, Lagoudakis, Parr 02
Lower is better!
35Summary of Algorithm
- Pick local basis functions hi
- Factored LP computes value function
- Coordination graph computes argmaxa of Q
36Planning Complex Environments
- When faced with a complex problem, exploit
structure - For planning
- For action selection
37Generalizing to New Problems
Many problems are similar
Good solution to Problem n1
Solve Problem 1
Solve Problem n
Solve Problem 2
MDPs are different! ? Different sets of states,
action, reward, transition,
38Generalization with Relational MDPs
Guestrin, Koller, Gearhart, Kanodia 03
Similar domains have similar types of
objects ?
Relational MDP
Exploit similarities by computing generalizable
value functions
Generalization
Avoid need to replan Tackle larger problems
39Relational Models and MDPs
- Classes
- Peasant, Gold, Wood, Barracks, Footman, Enemy
- Relations
- Collects, Builds, Trains, Attacks
- Instances
- Peasant1, Peasant2, Footman1, Enemy1
40Relational MDPs
- Class-level transition probabilities depends on
- Attributes Actions Attributes of related
objects - Class-level reward function
Very compact representation! Does not depend on
of objects
41Tactical Freecraft Relational Schema
Enemy
Footman
H
H
Health
Count
- Enemys health depends on footmen attacking
- Footmans health depends on Enemys health
42World is a Large Factored MDP
Links between objects
Relational MDP
Factored MDP
of objects
- Instantiation (world)
- instances of each class
- Links between instances
- Well-defined factored MDP
43World with 2 Footmen and 2 Enemies
44World is a Large Factored MDP
Links between objects
Relational MDP
Factored MDP
of objects
- Instantiate world
- Well-defined factored MDP
- Use factored LP for planning
- We have gained nothing! ?
45Class-level Value Functions
VF1(F1.H, E1.H)
VE1(E1.H)
VF2(F2.H, E2.H)
VE2(E2.H)
VF
VF
VE
VE
V?(F1.H, E1.H, F2.H, E2.H)
Units are Interchangeable!
VF1 ? VF2 ? VF
VE1 ? VE2 ? VE
At state x, each footman has different
contribution to V
Given VC can instantiate value function for any
world ?
46Computing Class-level VC
³
- Constraints for each world represented by
factored LP ? - Number of worlds exponential or infinite ?
47Sampling Worlds
Sampling
? ?, x, a
? ? ? I , ? x, a
- Many worlds are similar
- Sample set I of worlds
48Theorem
- Exponentially (infinitely) many worlds !
- need exponentially many samples?
NO!
Value function within ? of class-level solution
optimized for all worlds, with prob. at least 1-?
Rmax is the maximum class reward Proof method
related to de Farias, Van Roy 02
49Learning Classes of Objects
Find regularities between worlds
Objects with similar values belong to same class
Plan for sampled worlds separately
Used decision tree regression in experiments
50Summary of Algorithm
- Model domain as Relational MDPs
- Sample set of worlds
- Factored LP computes class-level value function
for sampled worlds - Reuse class-level value function in new world
- Coordination graph computes argmaxa of Q
51Experimental Results
52Generalizing to New Problems
53Learning Classes of Objects
54Classes of Objects Discovered
55Strategic
- World
- 2 Peasants, 2 Footmen,
- 1 Enemy, Gold, Wood, Barracks
- Reward for dead enemy
- About 1 million state/action pairs
- Algorithm
- Solve with Factored LP
- Coordination graph for action selection
?
56Strategic
- World
- 9 Peasants, 3 Footmen,
- 1 Enemy, Gold, Wood, Barracks
- Reward for dead enemy
- About 3 trillion state/action pairs
- Algorithm
- Solve with factored LP
- Coordination graph for action selection
grows exponentially in agents
57Strategic
- World
- 9 Peasants, 3 Footmen,
- 1 Enemy, Gold, Wood, Barracks
- Reward for dead enemy
- About 3 trillion state/action pairs
- Algorithm
- Use generalized class-based value function
- Coordination graph for action selection
?
instantiated Q-functions grow polynomially in
agents
58Tactical
3 vs. 3
4 vs. 4
Generalize
- Planned in 3 Footmen versus 3 Enemies
- Generalized to 4 Footmen versus 4 Enemies
59Contributions
- Efficient planning with LP decomposition
- Guestrin, Koller, Parr 01
- Multiagent action selection
- Guestrin, Koller, Parr 02
- Generalization to new environments
- Guestrin, Koller, Gearhart, Kanodia 03
- Variable coordination structure
- Guestrin, Venkataraman, Koller 02
- Multiagent reinforcement learning
- Guestrin, Lagoudakis, Parr 02 Guestrin,
Patrascu, Schuurmans 02 - Hierarchical decomposition
- Guestrin, Gordon 02
60Open Issues
- High tree-width problems
- Basis function selection
-
- Variable relational structure
- Partial observability
-
61Thank You!
- Daphne Koller
- Committee
- Leslie Kaelbling, Yoav Shoham, Claire Tomlin, Ben
Van Roy - Co-authors
- DAGS members
- Kristina and Friends
- My Family
M.S. Apaydin, D. Brutlag, F. Cozman, C.
Gearhart, G. Gordon, D. Hsu, N. Kanodia, D.
Koller, E. Krotkov, M. Lagoudakis, J.C. Latombe,
D. Ormoneit, R. Parr, R. Patrascu, D.
Schuurmans, C. Varma, S. Venkataraman.
62Conclusions
Complex multiagent planning task
144365965422032752148167664920368 226828597346
70489954077831385060806196390977769687258235595095
45 82100618911865342725257953674027620225198320803
87801477422896484 12743904001175886180411289478156
23094438061566173054086674490506 17812548034440554
70543970388958174653682549161362208302685637785 82
29022846398307887896918556404084898937609373242171
846359938695 5167650189405881090604260896714388641
028143503856487471658320106 1436613217310276890285
5220001
1322070819480806636890455259752
states
Formal framework for multiagent planning
that scales to very large problems
very large
63Network Management Problem
- Computer runs processes
- Computer status good, dead, faulty
- Dead neighbors increase dying probability
- Reward for successful processes
- Each SysAdmin takes local action reboot, not
reboot
Ring
Ring of Rings
Star
k-grid
64Multiagent Policy Quality
- Comparing to Distributed Reward and Distributed
Value Function algorithms Schneider et al. 99
65Multiagent Policy Quality
- Comparing to Distributed Reward and Distributed
Value Function algorithms Schneider et al. 99
Distributed reward
Distributed value
66Multiagent Policy Quality
- Comparing to Distributed Reward and Distributed
Value Function algorithms Schneider et al. 99
LP pair basis
LP single basis
Distributed reward
Distributed value
67Comparing to Apricodd Boutilier et al.
- Apricodd
- Exploits context-specific independence (CSI)
- Factored LP
- Exploits CSI and linear independence
68Appricodd
Ring
Star