Title: Graphical Models for Online Solutions to Interactive POMDPs
1Graphical Models for Online Solutions to
Interactive POMDPs
International Conference on Autonomous Agents and
Multiagent Systems (AAMAS 2007)
- Prashant Doshi Yifeng Zeng Qiongyu
Chen - University of Georgia Aalborg University
National Univ. - USA Denmark of Singapore
2Decision-Making in Multiagent Settings
State (S)
Actions (Aj)
Belief over state and model of i
Actions (Ai)
Belief over state and model of j
Observations (Oi)
Observations (Oj)
Agent j
Agent i
Act to optimize preferences given beliefs
3Finitely Nested I-POMDP (GmytrasiewiczDoshi, 05)
- A finitely nested I-POMDP of agent i with a
strategy level l - Interactive states
- Beliefs about physical environments
- Beliefs about other agents in terms of their
preferences, capabilities, and beliefs - Type
- A Joint actions
- Possible observations
- Ti Transition function SAS? 0,1
- Oi Observation function SA ?0,1
- Ri Reward function SA?
4Belief Update
5Forget It!
- Different approach
- Use the language of Influence Diagrams (IDs) to
represent the problem more transparently - Belief update
- Use standard ID algorithms to solve it
- Solution
6Challenges
- Representation of nested models for other agents
- Influence diagram is a single agent oriented
language - Update beliefs on models of other agents
- New models of other agents
- Over time agents revise beliefs over the models
of others as they receive observations
7Related Work
- Multiagent Influence Diagrams (MAIDs)
(KollerMilch,2001) - Uses IDs to represent incomplete information
games - Compute Nash equilibrium solutions efficiently by
exploiting - conditional independence
- Network of Influence Diagrams (NIDs)
(GalPfeffer,2003) - Allows uncertainty over the game
- Allows multiple models of an individual agent
- Solution involves collapsing models into a MAID
or ID - Both model static single play games
- Do not consider agent interactions over time
(sequential decision-making)
8Introduce Model Node and Policy Link
Ri
- A generic level l Interactive-ID (I-ID) for agent
i situated with one other agent j - Model Node Mj,l-1
- Models of agent j at level l-1
- Policy link dashed line
- Distribution over the other agents actions given
its models - Beliefs on Mj,l-1
- P(Mj,l-1s)
- Update?
Ai
S
Oi
9Details of the Model Node
- Members of the model node
- Different chance nodes are solutions of models
mj,l-1 - ModMj represents the different models of agent
j - CPT of the chance node Aj is a multiplexer
- Assumes the distribution of each of the action
nodes (Aj1, Aj2) depending on the value of ModMj
Mj,l-1
Aj
S
ModMj
mj,l-11
Aj1
mj,l-11, mj,l-12 could be I-IDs or IDs
mj,l-12
Aj2
10Whole I-ID
Ri
Ai
S
Aj
Oi
ModMj
mj,l-11, mj,l-12 could be I-IDs or IDs
Aj1
Aj2
mj,l-11
mj,l-12
11Interactive Dynamic Influence Diagrams (I-DIDs)
Ri
Ait
Ajt
St
Oit
Mj,l-1t
Model Update Link
12Semantics of Model Update Link
Ajt1
Mj,l-1t1
Ajt
st1
Mj,l-1t
ModMjt1
st
mj,l-1t1,1
Aj1
ModMjt
mj,l-1t1,2
Oj
Aj2
mj,l-1t1,3
mj,l-1t,1
Aj3
Aj1
Oj1
mj,l-1t1,4
mj,l-1t,2
Aj4
Aj2
Oj2
These models differ in their initial beliefs,
each of which is the result of j updating its
beliefs due to its actions and possible
observations
13Notes
- Updated set of models at time step (t1) will
have at most models - number of models at time step t
- largest space of actions
- largest space of observations
- New distribution over the updated models uses
- original distribution over the models
- probability of the other agent performing the
action, and - receiving the observation that led to the updated
model
14mj,l-1t1,1
Aj1
mj,l-1t1,2
Aj2
mj,l-1t1,3
mj,l-1t,1
Aj3
mj,l-1t1,4
mj,l-1t,2
Aj4
15Example Applications Emergence of Social
Behaviors
- Followership and Leadership in the persistent
multiagent tiger problem - Altruism and Reciprocity in the public good
problem with punishment - Strategies in a simple version of two-player
Poker
16Followership and Leadership in Multiagent
Persistent Tiger
- Experimental Setup
- Agent j has a better hearing capability (95
accurate) compared to is (65 accuracy) - Agent i does not have initial information about
the tigers location - Agent i considers two models of agent j which
differ in js level 0 initial beliefs - Agent j likely thinks that the tiger is behind
the left door - Agent j likely thinks that the tiger is behind
the right door - Solve the corresponding level 1 I-DID expanded
over three time steps and get the normative
behavioral policy of agent i
17Level 1 I-ID in the Tiger Problem
Expand over three time steps
Mapping decision nodes to chance nodes
18Policy Tree 1 Agent i has hearing accuracy of
65
L
GL,
GR,
L
L
GL,
GR,CL
GL,CR
GR,
GL,S/CL
GR,S/CR
L
OL
OR
L
L
L
Conditional Followership
19Policy Tree 2 Agent i loses hearing ability
(accuracy is 0.5)
L
,
L
,CR
,S
,CL
OR
OL
L
Unconditional (Blind) Followership
20Example 2 Altruism and Reciprocity in the Public
Good Problem
- Public Good Game
- Two agents initially endowed with XT amount of
resources - Each agent may choose
- contribute (C) a fixed amount of the resources to
a public pot - not contribute ie. defect (D)
- Agents actions and pot are not observable, but
agents receive an observation symbolizing the
state of the public pot - plenty (PY)
- meager (MR)
- Value of resources in the public pot is
discounted by ci (
is the marginal private return - In order to encourage contributions, the
contributing agents punish free riders P but
incur a small cost cp for administering the
punishment
21Agent Types
- Altruistic and Non-altruistic types
- Altruistic agent has a high marginal private
return (ci is close to 1) and does not punish
others who defect - Optimal Behavior
- One action remaining both types of agents choose
to contribute to avoid being punished - Two actions to go altruistic type chooses to
contribute, while the other defects - Why?
- Three steps to go the altruistic agent
contributes to avoid punishment and the
non-altruistic type defects - Greater than three steps altruistic agent
continues to contribute to the public pot
depending on how close its marginal return is to
1, the non-altruistic type prescribes defection
22Level 1 I-ID in the Public Good Game
Expand over three time steps
23Policy Tree 1 Altruism in PG
C
- If agent i (altruistic type) believes with a
probability 1 that j is altruistic, i chooses to
contribute for each of the three steps. - This behavior persists when i is unaware of
whether j is altruistic, and when i assigns a
high probability to j being the non-altruistic
type
C
C
24Policy Tree 2 Reciprocal Agents
- Reciprocal Type
- The reciprocal types marginal private return is
less and obtains a greater payoff when its action
is similar to that of the other - Experimental Setup
- Consider the case when the reciprocal agent i is
unsure of whether j is altruistic and believes
that the public pot is likely to be half full - Optimal Behavior
- From this prior belief, i chooses to defect
- On receiving an observation of plenty, i decides
to contribute, while an observation of meager
makes it defect - With one action to go, i believes that j
contributes, will choose to contribute too to
avoid punishment regardless of its observations
D
PY
MR
D
C
C
C
25Conclusion and Future Work
- I-DIDs A general ID-based formalism for
sequential decision-making in multiagent settings
- Online counterparts of I-POMDPs
- Solving I-DIDs approximately for computational
efficiency (see AAAI 07 paper on model
clustering) - Apply I-DIDs to other application domains
- Visit our poster on I-DIDs today for more
information
26Thank You!