Title: MultiAgent Planning in Complex Uncertain Environments
1Multi-Agent Planning in Complex Uncertain
Environments
- Daphne Koller
- Stanford University
Joint work with Carlos Guestrin (CMU) Ronald
Parr (Duke)
2Collaborative Multiagent Planning
Collaborative Multiagent Planning
Long-term goals
Multiple agents
Coordinated decisions
- Search and rescue, firefighting
- Factory management
- Multi-robot tasks (Robosoccer)
- Network routing
- Air traffic control
- Computer game playing
3Joint Planning Space
- Joint action space
- Each agent i takes action ai at each step
- Joint action a a1,, an for all agents
- Joint state space
- Assignment x1,,xn to some set of variables
X1,,Xn - Joint state x x1,, xn of entire system
- Joint system Payoffs and state dynamics depend
on joint state and joint action - Cooperative agents Want to maximize total payoff
4Exploiting Structure
- Real-world problems have
- Hundreds of objects
- Googles of states
- Real-world problems have structure!
Approach Exploit structured representation to
obtain efficient approximate solution
5Outline
- Action Coordination
- Factored Value Functions
- Coordination Graphs
- Context-Specific Coordination
- Joint Planning
- Multi-Agent Markov Decision Processes
- Efficient Linear Programming Solution
- Decentralized Market-Based Solution
- Generalizing to New Environments
- Relational MDPs
- Generalizing Value Functions
6One-Shot Optimization Task
- Q-function Q(x,a) encodes agents payoff for
joint action a in joint state x - Agents task To compute
- actions is exponential ?
- Complete state observability ?
- Full agent communication ?
7Factored Payoff Function
- Approximate Q function as sum of Q sub-functions
- Each sub-function depends on local part of system
- Two interacting agents
- Agent and important resource
- Two inter-dependent pieces of machinery
Q(A1,,A4, X1,,X4)
¼
Q2(A1, A2, X1,X2)
Q1(A1, A4, X1,X4)
Q3(A2, A3, X2,X3)
Q4(A3, A4, X3,X4)
K. Parr 99,00 Guestrin, K., Parr 01
8Distributed Q Function
- Q sub-functions assigned to relevant agents
Q(A1,,A4, X1,,X4)
¼
Q2(A1, A2, X1,X2)
Q1(A1, A4, X1,X4)
Q4(A3, A4, X3,X4)
Q3(A2, A3, X2,X3)
Guestrin, K., Parr 01
9Multiagent Action Selection
Instantiate current state x
Maximal action argmaxa
Distributed Q function
Q2(A1, A2, X1,X2)
Q1(A1, A4, X1,X4)
Q3(A2, A3, X2,X3)
Q4(A3, A4, X3,X4)
10Instantiating State x
Limited observability ? agent i only observes
variables in Qi
Q2(A1, A2, X1,X2)
Q1(A1, A4, X1,X4)
Q3(A2, A3, X2,X3)
Q4(A3, A4, X3,X4)
11Choosing Action at State x
Instantiate current state x
Q2(A1, A2, X1,X2)
Q2(A1, A2)
Q1(A1, A4, X1,X4)
Q1(A1, A4)
Q3(A2, A3, X2,X3)
Q3(A2, A3)
Q4(A3, A4, X3,X4)
Q4(A3, A4)
12Variable Elimination
maxa
- Use variable elimination for maximization
Q2(A1, A2)
Q1(A1, A4)
Q3(A2, A3)
- Limited communication ? for optimal action choice
- Comm. bandwidth tree-width of coord. graph
Q4(A3, A4)
13Choosing Action at State x
14Choosing Action at State x
Q2(A1, A2)
max
g1(A2, A4)
Q1(A1, A4)
A3
Q3(A2, A3)
Q4(A3, A4)
15Coordination Graphs
- Communication follows triangulated graph
- Computation grows exponentially in tree width
- Graph-theoretic measure of connectedness
- Arises in BNs, CSPs,
- Cost exponential in worst case,
- fairly low for many real graphs
A11
A10
A5
A1
A2
16Context-Specific Interactions
- Payoff structure can vary by context
- Agents A1, A2 both trying to pass through same
narrow corridor - Can use context-specific value rules
- ltAt(X,A1), At(X,A2),
- A1 fwd ? A2 fwd -100gt
- Hope Context-specific payoffs will induce
context-specific coordination
17Context-Specific Coordination
A6
A5
A1
A2
A4
A3
Instantiate current state x true
18Context-Specific Coordination
A6
A5
Coordination structure varies based on context
A1
A2
A4
A3
19Context-Specific Coordination
A6
A5
Coordination structure varies based on
communication
A1
A2
A4
A3
Maximizing out A1
Rule-based variable elimination Zhang Poole
99
20Context-Specific Coordination
A6
A5
Coordination structure varies based on agent
decisions
A1
A2
A4
A3
Eliminate A1 from the graph
Rule-based variable elimination Zhang Poole
99
21Robot Soccer
Kok, Vlassis Groen University of Amsterdam
- UvA Trilearn 2002 won German Open 2002, but
placed fourth in Robocup-2002. - the improvements introduced in UvA Trilearn
2003 include an extension of the intercept
skill, improved passing behavior and especially
the usage of coordination graphs to specify the
coordination requirements between the different
agents.
22RoboSoccer Value Rules
- Coordination graph rules include conditions on
player role and aspects of global system state - Example rules for player i, in role of passer
Depends on distance of j to goal after move
23UvA Trilearn 2003 Results
- UvA Trilearn won
- German Open 2003
- US Open 2003
- RoboCup 2003
- German Open 2004
24Outline
- Action Coordination
- Factored Value Functions
- Coordination Graphs
- Context-Specific Coordination
- Joint Planning
- Multi-Agent Markov Decision Processes
- Efficient Linear Programming Solution
- Decentralized Market-Based Solution
- Generalizing to New Environments
- Relational MDPs
- Generalizing Value Functions
25peasant
footman
- Real-time Strategy Game
- Peasants collect resources and build
- Footmen attack enemies
- Buildings train peasants and footmen
building
26Planning Over Time
Markov Decision Process (MDP) representation
- Action space Joint agent actions a a1,, an
- State space Joint state descriptions x x1,,
xn - Momentary reward function R(x,a)
- Probabilistic system dynamics P(xx,a)
27Policy
At state x, action a for all agents
Policy ?(x) a
28Value of Policy
Expected long-term reward starting from x
Value V?(x)
?(x0)
29Optimal Long-term Plan
Optimal Q-function Q(x,a)
Optimal policy ?(x)
Bellman Equations
30Solving an MDP
Solve Bellman equation
Optimal value V(x)
Optimal policy ?(x)
Many algorithms solve the Bellman equations
- Policy iteration Howard 60, Bellman 57
- Value iteration Bellman 57
- Linear programming Manne 60
31LP Solution to MDP
- One variable V (x) for each state
- One constraint for each state x and action a
- Polynomial time solution
32Are We Done?
- Planning is polynomial in states and actions
- states exponential in number of variables
- actions exponential in number of agents
Efficient approximation by exploiting structure!
33Structured Representation
Factored MDP
Boutilier et al. 95
P(FF,G,AB,AF)
- State
- Dynamics
- Decisions
- Rewards
Complexity of representation Exponential in
parents (worst case)
34Structured Value function ?
Factored MDP ? Structure in V
Factored MDP Structure in V
Factored V often provides good approximate value
function
35Structured Value Functions
Bellman et al. 63, Tsitsiklis Van Roy
96 K. Parr 99,00
- Approximate V as a factored value function
- In rule-based case
- hi is a rule concerning small part of the system
- wi is the value associated with the rule
- Goal find w giving good approximation V to V
Factored value function V ? wi hi
Factored Q function Q ? Qi
Can use coordination graph
36Approximate LP Solution
³
- One variable wi for each basis function ?
- Polynomial number of LP variables
- One constraint for every state and action ?
- Exponentially many LP constraints
37So What Now?
Guestrin, K., Parr 01
Exponentially many linear one nonlinear
constraint
38Variable Elimination Revisited
Guestrin, K., Parr 01
- Use Variable Elimination to represent constraints
Exponentially fewer constraints
Polynomial LP for finding good factored
approximation to V
39Network Management Problem
- Computer runs processes
- Computer status good, dead, faulty
- Dead neighbors increase dying probability
- Reward for successful processes
- Each SysAdmin takes local action reboot, not
reboot
Ring
Ring of Rings
Star
k-grid
40Scaling of Factored LP
41Multiagent Running Time
Ring of rings
Star pair basis
Star single basis
42Strategic 2x2
Factored MDP model 2 Peasants, 2 Footmen, Enemy,
Gold, Wood, Barracks 1 million state/action pairs
Factored LP computes value function
Q
Coordination graph computes argmaxa Q(x,a)
World
43Demo Strategic 2x2
Guestrin, Koller, Gearhart Kanodia
44Limited Interaction MDPs
Guestrin Gordon, 02
- Some MDPs have additional structure
- Agents are largely autonomous
- Interact in limited ways
- e.g., competing for resources
- Can decompose MDP as set of agent-based MDPs,
with limited interface
45Limited Interaction MDPs
Guestrin Gordon, 02
- In such MDPs, our LP matrix is highly structured
- Can use Dantzig-Wolfe LP decomposition to solve
LP optimally, in a decentralized way - Gives rise to a market-like algorithm with
multiple agents and a centralized auctioneer
46Auction-style planning
Set pricing based on conflicts
Guestrin Gordon, 02
- Each agent solves local (stand-alone) MDP
- Agents send constraint messages to auctioneer
- Must agree on policy for shared variables
- Auctioneer sends pricing messages to agents
- Pricing reflects penalties for constraint
violations - Influences agents rewards in their MDP
47Fuel Allocation Problem
UAV start
Target
- UAVs share a pot of fuel
- Targets have varying priority
- Ignore target interference
Bererton, Gordon, Thrun Khosla
48Fuel Allocation Problem
Bererton, Gordon, Thrun, Khosla , 03
49High-Speed Robot Paintball
Bererton, Gordon Thrun
50High-Speed Robot Paintball
Game variant 2
Game variant 1
Coordination point Sensor Placement
x start location goal location
51High-Speed Robot Paintball
Bererton, Gordon Thrun
52Outline
- Action Coordination
- Factored Value Functions
- Coordination Graphs
- Context-Specific Coordination
- Joint Planning
- Multi-Agent Markov Decision Processes
- Efficient Linear Programming Solution
- Decentralized Market-Based Solution
- Generalizing to New Environments
- Relational MDPs
- Generalizing Value Functions
53Generalizing to New Problems
Many problems are similar
Good solution to Problem n1
Solve Problem 1
Solve Problem n
Solve Problem 2
MDPs are different! ? Different sets of states,
action, reward, transition,
54Generalizing with Relational MDPs
Similar domains have similar types of
objects ?
Relational MDP
Exploit similarities by computing generalizable
value functions
Generalization
Avoid need to replan Tackle larger problems
55Relational Models and MDPs
Guestrin, K., Gearhart Kanodia 03
- Classes
- Peasant, Footman, Gold, Barracks, Enemy
- Relations
- Collects, Builds, Trains, Attacks
- Instances
- Peasant1, Peasant2, Footman1, Enemy1
- Builds on Probabilistic Relational Models K.
Pfeffer 98
56Relational MDPs
Guestrin, K., Gearhart Kanodia 03
Enemy
Footman
- Class-level transition probabilities depends on
- Attributes Actions Attributes of related
objects - Class-level reward function
Very compact representation! Does not depend on
of objects
57World is a Large Factored MDP
Links between objects
Relational MDP
Factored MDP
of objects
- Instantiation (world)
- instances of each class
- Links between instances
- Well-defined factored MDP
58MDP with 2 Footmen and 2 Enemies
59World is a Large Factored MDP
Links between objects
Relational MDP
Factored MDP
of objects
- Instantiate world
- Well-defined factored MDP
- Use factored LP for planning
- We have gained nothing! ?
60Class-level Value Functions
VF1(F1.H, E1.H)
VE1(E1.H)
VF2(F2.H, E2.H)
VE2(E2.H)
VF
VF
VE
VE
V?(F1.H, E1.H, F2.H, E2.H)
Units are Interchangeable!
VF1 ? VF2 ? VF
VE1 ? VE2 ? VE
At state x, each footman has different
contribution to V
Given wC can instantiate value function for any
world ?
61Factored LP-based Generalization
How many samples?
Generalize
Class- level factored LP
VF
VE
Sample Set I
62Sampling Complexity
Exponentially many worlds
need exponentially many samples?
?
objects in a world is unbounded
must sample very large worlds?
?
NO!
63Theorem
Sample m small worlds of up to O( ln 1/? )
objects
m
Value function within O(?) of class-level value
function optimized for all worlds, with prob. at
least 1-?
Rcmax is the maximum class reward
64Strategic 2x2
Relational MDP model
2 Peasants, 2 Footmen, Enemy, Gold, Wood,
Barracks 1 million state/action pairs
Factored LP computes value function
Q
Coordination graph computes argmaxa Q(x,a)
World
65Strategic 9x3
Relational MDP model
9 Peasants, 3 Footmen, Enemy, Gold, Wood,
Barracks
3 trillion state/action pairs
grows exponentially in agents
Factored LP computes value function
Qo
Coordination graph computes argmaxa Q(x,a)
World
66Strategic Generalization
Relational MDP model
2 Peasants, 2 Footmen, Enemy, Gold, Wood,
Barracks 1 million state/action pairs
Factored LP computes class-level value function
instantiated Q-functions grow polynomially in
agents
wC
Coordination graph computes argmaxa Q?(x,a)
World
67Demo Generalized 9x3
Guestrin, Koller, Gearhart Kanodia
68Tactical Generalization
3 v. 3
4 v. 4
Generalize
- Planned in 3 Footmen versus 3 Enemies
- Generalized to 4 Footmen versus 4 Enemies
69Demo Planned Tactical 3x3
Guestrin, Koller, Gearhart Kanodia
70Demo Generalized Tactical 4x4
Guestrin, Koller, Gearhart Kanodia
Guestrin, K., Gearhart Kanodia 03
71Summary
Effective planning under uncertainty
Distributed coordinated action selection
Generalization to new problems
Structured Multi-Agent MDPs
72Important Questions
Continuous spaces
Partial observability
Complex actions
Learning to act
How far can we go??
73Thank You!
http//robotics.stanford.edu/koller
Carlos Guestrin Ronald Parr
Chris Gearhart Neal Kanodia Shobha Venkataraman
Curt Bererton Geoff Gordon Sebastian Thrun
Jelle Kok Matthijs Spaan Nikos Vlassis