Multiagent Planning with Factored MDPs - PowerPoint PPT Presentation

1 / 67

About This Presentation

Title:

Multiagent Planning with Factored MDPs

Description:

Peasants collect resources and build. Footmen attack enemies ... (x0) = both peasants get wood. x0 (x1) = one peasant gets gold, other builds barrack ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 68

Provided by: robo69

Category:

more less

Transcript and Presenter's Notes

Title: Multiagent Planning with Factored MDPs

1
Multiagent Planning with Factored MDPs

Carlos Guestrin
Stanford University

2
Collaborative Multiagent Planning
Collaborative Multiagent Planning
Long-term goals
Multiple agents
Coordinated decisions

Search and rescue
Factory management
Supply chain
Firefighting
Network routing
Air traffic control

3
Exploiting Structure

Real-world problems have
Hundreds of objects
Googles of states
Real-world problems have structure!

Approach Exploit structured representation to
obtain efficient approximate solution
4
peasant
footman

Real-time Strategy Game
Peasants collect resources and build
Footmen attack enemies
Buildings train peasants and footmen

building
5
Joint Decision Space
Markov Decision Process (MDP) Representation

State space
Joint state x of entire system
Action space
Joint action a a1,, an for all agents
Reward function
Total reward R(x,a)
Transition model
Dynamics of the entire system P(xx,a)

6
Policy
At state x, action a for all agents
Policy ?(x) a
7
Value of Policy
Expected long-term reward starting from x
Value V?(x)
?(x0)
8
Optimal Long-term Plan
Optimal value function V(x)
Optimal Policy ?(x)
Bellman Equations
9
Solving an MDP
Solve Bellman equation
Optimal value V(x)
Optimal policy ?(x)
Many algorithms solve the Bellman equations

Policy iteration Howard 60, Bellman 57
Value iteration Bellman 57
Linear programming Manne 60

10
LP Solution to MDP
Manne 60

Value computed by linear programming

å

minimize
x
ì
)
,
(
x
a
Q
³

subject to
í
î

One variable V (x) for each state
One constraint for each state x and action a
Polynomial time solution

11
Planning under Bellmans Curse

Planning is Polynomial in states and actions
states exponential in number of variables
actions exponential in number of agents

Efficient approximation by exploiting structure!
12
Structure in Representation Factored MDP
Boutilier et al. 95
P(FF,G,AB,AF)

State
Dynamics
Decisions
Rewards

Complexity of representation Exponential in
parents (worst case)
13
Structured Value function ?
Factored MDP ? Structure in V
Factored MDP Structure in V
?
Structured V yields good approximate value
function
14
Structured Value Functions
Linear combination of restricted domain functions
Bellman et al. 63 Tsitsiklis Van Roy
96 Koller Parr 99,00 Guestrin et al.
01

å

V
)
(
)
(
x
x
i

Each hi is status of small part(s) of a complex
system
State of footman and enemy
Status of barracks
Status of barracks and state of footman
Structured V ? Structured Q
Must find w giving good approximate value function

15
Approximate LP Solution
Schweitzer and Seidmann 85
³

One variable wi for each basis function ?
Polynomial number of LP variables
One constraint for every state and action ?
Exponentially many LP constraints

16
Representing Exponentially Many Constraints
Guestrin, Koller, Parr 01
Exponentially many linear one nonlinear
constraint
Maximization over exponential space ?
17
Variable Elimination
Structured Value Function ?

Variable elimination to maximize over state space
Bertele Brioschi 72

Here we need only 23, instead of 63 sum operations

Maximization only exponential in largest factor
Tree-width characterizes complexity
Graph-theoretic measure of connectedness
Arises in many settings integer prog., Bayes
nets, comput. geometry,

18
Variable Elimination
Structured Value Function ?
small of Ais, Xjs
small of Xjs

Variable elimination to maximize over state space
Bertele Brioschi 72

Here we need only 23, instead of 63 sum operations

Maximization only exponential in largest factor
Tree-width characterizes complexity
Graph-theoretic measure of connectedness
Arises in many settings integer prog., Bayes
nets, comput. geometry,

19
Representing the Constraints

Use Variable Elimination to represent constraints

Number of constraints exponentially smaller!
20
Understanding Scaling Properties
Explicit LP Factored LP k tree-width
2n (n1-k)2k
21
Network Management Problem

Computer status good, dead, faulty
Dead neighbors increase dying probability
Computer runs processes
Reward for successful processes
Each SysAdmin takes local action reboot, not
reboot

Problem with n machines ? 9n states, 2n actions
Ring
Ring of Rings
Star
k-grid
22
Running Time
k tree-width
23
Summary of Algorithm

Pick local basis functions hi
Factored LP computes value function
Policy is argmaxa of Q

24
Large-scale Multiagent Coordination

Efficient algorithm computes V
Action at state x is

actions is exponential ?
Complete observability ?
Full communication ?

25
Distributed Q Function
Guestrin, Koller, Parr 02
Distributed Q function
Each agent maintains a part of the Q function
Q(A1,,A4, X1,,X4)

Q2(A1, A2, X1,X2)
Q1(A1, A4, X1,X4)

Q4(A3, A4, X3,X4)
Q3(A2, A3, X2,X3)

26
Multiagent Action Selection
Instantiate current state x
Maximal action argmaxa
Distributed Q function
Q2(A1, A2, X1,X2)
Q1(A1, A4, X1,X4)
Q3(A2, A3, X2,X3)
Q4(A3, A4, X3,X4)
27
Instantiate Current State x
Instantiate current state x
Limited observability ? agent i only observes
variables in Qi
Q2(A1, A2, X1,X2)
Q2(A1, A2)
Q1(A1, A4, X1,X4)
Q1(A1, A4)
Q3(A2, A3, X2,X3)
Q3(A2, A3)
Q4(A3, A4, X3,X4)
Q4(A3, A4)
28
Multiagent Action Selection
Instantiate current state x
Maximal action argmaxa
Distributed Q function
Q2(A1, A2)
Q1(A1, A4)
Q3(A2, A3)
Q4(A3, A4)
29
Coordination Graph

maxa

Use variable elimination for maximization

Q2(A1, A2)
Q1(A1, A4)
Q3(A2, A3)
A2 A4 Value of optimal A3 action
Attack Attack 5
Attack Defend 6
Defend Attack 8
Defend Defend 12

Limited communication ? for optimal action choice
Comm. bandwidth tree-width of coord. graph

Q4(A3, A4)
30
Coordination Graph Example
A1

Trees dont increase communication requirements
Cycles require graph triangulation

A4
A5
31
Unified View Function Approximation ?
Multiagent Coordination
Factored MDP and value function representations
induce communication, coordination
Q1(A1, A4, X1,X4) Q2(A1, A2, X1,X2) Q3(A2,
A3, X2,X3) Q4(A3, A4, X3,X4)
32
How good are the policies?

SysAdmin problem
Power grid problem Schneider et al. 99

33
SysAdmin Ring - Quality of Policies
34
Power Grid Factored Multiagent
Guestrin, Lagoudakis, Parr 02
Lower is better!
35
Summary of Algorithm

Pick local basis functions hi
Factored LP computes value function
Coordination graph computes argmaxa of Q

36
Planning Complex Environments

When faced with a complex problem, exploit
structure
For planning
For action selection

37
Generalizing to New Problems
Many problems are similar
Good solution to Problem n1
Solve Problem 1
Solve Problem n
Solve Problem 2
MDPs are different! ? Different sets of states,
action, reward, transition,
38
Generalization with Relational MDPs
Guestrin, Koller, Gearhart, Kanodia 03
Similar domains have similar types of
objects ?
Relational MDP
Exploit similarities by computing generalizable
value functions
Generalization
Avoid need to replan Tackle larger problems
39
Relational Models and MDPs

Classes
Peasant, Gold, Wood, Barracks, Footman, Enemy
Relations
Collects, Builds, Trains, Attacks
Instances
Peasant1, Peasant2, Footman1, Enemy1

40
Relational MDPs

Class-level transition probabilities depends on
Attributes Actions Attributes of related
objects
Class-level reward function

Very compact representation! Does not depend on
of objects
41
Tactical Freecraft Relational Schema
Enemy
Footman
H
H
Health
Count

Enemys health depends on footmen attacking
Footmans health depends on Enemys health

42
World is a Large Factored MDP
Links between objects
Relational MDP
Factored MDP
of objects

Instantiation (world)
instances of each class
Links between instances
Well-defined factored MDP

43
World with 2 Footmen and 2 Enemies
44
World is a Large Factored MDP
Links between objects
Relational MDP
Factored MDP
of objects

Instantiate world
Well-defined factored MDP
Use factored LP for planning
We have gained nothing! ?

45
Class-level Value Functions
VF1(F1.H, E1.H)
VE1(E1.H)
VF2(F2.H, E2.H)
VE2(E2.H)
VF
VF
VE
VE
V?(F1.H, E1.H, F2.H, E2.H)
Units are Interchangeable!

VF1 ? VF2 ? VF
VE1 ? VE2 ? VE
At state x, each footman has different
contribution to V
Given VC can instantiate value function for any
world ?
46
Computing Class-level VC
³

Constraints for each world represented by
factored LP ?
Number of worlds exponential or infinite ?

47
Sampling Worlds
Sampling
? ?, x, a
? ? ? I , ? x, a

Many worlds are similar
Sample set I of worlds

48
Theorem

Exponentially (infinitely) many worlds !
need exponentially many samples?

NO!
Value function within ? of class-level solution
optimized for all worlds, with prob. at least 1-?
Rmax is the maximum class reward Proof method
related to de Farias, Van Roy 02
49
Learning Classes of Objects
Find regularities between worlds
Objects with similar values belong to same class
Plan for sampled worlds separately
Used decision tree regression in experiments
50
Summary of Algorithm

Model domain as Relational MDPs
Sample set of worlds
Factored LP computes class-level value function
for sampled worlds
Reuse class-level value function in new world
Coordination graph computes argmaxa of Q

51
Experimental Results

SysAdmin problem

52
Generalizing to New Problems
53
Learning Classes of Objects
54
Classes of Objects Discovered

Learned 3 classes

55
Strategic

World
2 Peasants, 2 Footmen,
1 Enemy, Gold, Wood, Barracks
Reward for dead enemy
About 1 million state/action pairs
Algorithm
Solve with Factored LP
Coordination graph for action selection

?
56
Strategic

World
9 Peasants, 3 Footmen,
1 Enemy, Gold, Wood, Barracks
Reward for dead enemy
About 3 trillion state/action pairs
Algorithm
Solve with factored LP
Coordination graph for action selection

grows exponentially in agents
57
Strategic

World
9 Peasants, 3 Footmen,
1 Enemy, Gold, Wood, Barracks
Reward for dead enemy
About 3 trillion state/action pairs
Algorithm
Use generalized class-based value function
Coordination graph for action selection

?
instantiated Q-functions grow polynomially in
agents
58
Tactical
3 vs. 3
4 vs. 4
Generalize

Planned in 3 Footmen versus 3 Enemies
Generalized to 4 Footmen versus 4 Enemies

59
Contributions

Efficient planning with LP decomposition
Guestrin, Koller, Parr 01
Multiagent action selection
Guestrin, Koller, Parr 02
Generalization to new environments
Guestrin, Koller, Gearhart, Kanodia 03
Variable coordination structure
Guestrin, Venkataraman, Koller 02
Multiagent reinforcement learning
Guestrin, Lagoudakis, Parr 02 Guestrin,
Patrascu, Schuurmans 02
Hierarchical decomposition
Guestrin, Gordon 02

60
Open Issues

High tree-width problems
Basis function selection
Variable relational structure
Partial observability

61
Thank You!

Daphne Koller
Committee
Leslie Kaelbling, Yoav Shoham, Claire Tomlin, Ben
Van Roy
Co-authors
DAGS members
Kristina and Friends
My Family

M.S. Apaydin, D. Brutlag, F. Cozman, C.
Gearhart, G. Gordon, D. Hsu, N. Kanodia, D.
Koller, E. Krotkov, M. Lagoudakis, J.C. Latombe,
D. Ormoneit, R. Parr, R. Patrascu, D.
Schuurmans, C. Varma, S. Venkataraman.
62
Conclusions
Complex multiagent planning task

144365965422032752148167664920368 226828597346
70489954077831385060806196390977769687258235595095
45 82100618911865342725257953674027620225198320803
87801477422896484 12743904001175886180411289478156
23094438061566173054086674490506 17812548034440554
70543970388958174653682549161362208302685637785 82
29022846398307887896918556404084898937609373242171
846359938695 5167650189405881090604260896714388641
028143503856487471658320106 1436613217310276890285
5220001
1322070819480806636890455259752
states
Formal framework for multiagent planning
that scales to very large problems
very large
63
Network Management Problem