Title: Reasoning in Uncertain Adversarial Environments in AgentMultiagent Systems
1Reasoning in Uncertain Adversarial Environments
in Agent/Multiagent Systems
- Praveen Paruchuri
- University of Southern California
Guidance Committee Milind Tambe, Leana
Golubchik, Gaurav S. Sukhatme, Sarit Kraus, Stacy
Marsella, Fernando Ordonez
2Motivation The Prediction Game
- An UAV (Unmanned Aerial Vehicle)
- Flies between the 4 regions
- Can you predict the UAV-fly patterns ??
- Pattern 1
- 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,
- Pattern 2
- 1, 4, 3, 1, 1, 4, 2, 4, 2, 3, 4, 3, (as
generated by 4-sided dice) - Can you predict if 100 numbers in pattern 2 are
given ?? - Randomization decreases Predictability
- Increases Security
3Problem Definition
- Problem Provide security for agent/agent-team
acting in uncertain adversarial environments. - Assumptions for Agent/agent-team
- Acting in uncertain adversarial environments
- Adversary is unobservable
- Adversarys actions/capabilities or payoffs are
unknown or difficult to model explicitly - Assumptions for Adversary
- Can see the agents state or belief state
- Knows the agents plan/policy
- Exploits the action predictability
4Solution Technique
- Technique developed
- Intentional policy randomization for security
- MDP/POMDP framework to handle sequential decision
making under environment uncertainty - MDP ? Markov Decision Process
- POMDP ? Partially Observable MDP
- Increase Security gt Solve Multi-criteria problem
for agents - Maximize action unpredictability (Policy
randomization) - Maintain reward above threshold (Quality
constraints)
5Domains
- Scheduled activities at airports like security
check, refueling etc - Observed by terrorists
- Randomization of schedules helpful
- UAV/UAV-team patrolling humanitarian mission
- Adversary monitoring UAV schedule to disrupt
mission - Can disrupt food, harm refugees, shoot down UAVs
etc - Randomize UAV patrol policy (More domains in
report)
6My Contributions
- Two main contributions
- Single Agent Case
- Formulate as Non linear program Entropy based
metric - Convert to Linear Program called BRLP ( Binary
search for randomization ) - Randomize single agent policies with reward gt
threshold - Multi Agent Case RDR ( Rolling Down
Randomization ) - Randomized policies for decentralized POMDPs
- Threshold on team reward
7Related work
- Randomized policies in literature
- Stochastic Games
- Rock-Paper-Scissors game, Littman 1994
- CMDPs
- Constrained MDPs, Altman 1999
- Privacy
- Act on information while maintaining privacy,
Otterloo 2005 - Security via randomization
- Patrol units for security, Carroll 2005
- Randomized patrol sentry vehicles, Lewis 2005
- Randomized Police Patrol, Billante 2003
8Plan for Talk
- Achieved Contribution versus Expected Contribution
Increase Security for Agent/Agent Team Acting in
Uncertain, Adversarial domains
No Adversary Model
Partial Adversary Model
No Communication
Limited Communication
Dec- POMDP based Agent Team
Dec-MDP based Agent Team
Dec-POMDP Based Agent Team
MDP based Single Agent
9MDP based single agent case
- MDP is tuple lt S, A, P, R gt
- S Set of states
- A Set of actions
- P Transition function
- R Reward function
- Basic terms used
- x(s,a) Expected times action a is taken in
state s - Policy (as function of MDP flows)
10Entropy Measure of randomness
- Randomness or information content quantified
using Entropy ( Shannon 1948 ) - Entropy for MDP -
- Additive Entropy Add entropies of each state
- Weighted Entropy Weigh each state by it
contribution to total flow - where alpha_j is the initial flow of the system
11Tradeoff Reward vs Entropy
- Non-linear Program Max entropy, Reward above
threshold - Objective (Entropy) is non-linear
- BRLP ( Binary Search for Randomization LP )
- Linear Program
- No entropy calculation, Entropy as function of
flows
12BRLP
- Given input and target reward
- Poly-time convergence to within of target
reward. - Monotonicity Entropy decreases or constant with
increasing reward. - Control through
- Input can be any high entropy policy
- One such input is the uniform policy
- Equal probability for all actions out of all
states - Controls the final policy structure
13LP for Binary Search
- Policy as function of and
- Linear Program
14BRLP in Action
Beta .5
1 - Max entropy
0 Deterministic Max Reward
Target Reward
15Results (Averaged over 10 MDPs)
- For a given reward threshold,
- Highest entropy Expected Entropy Method 10
avg gain over BRLP - Fastest BRLP 7 fold average speedup over
Expected Entropy - These results are statistically significant
(t-Tests performed)
16Multi Agent Case Problem
- Maximize entropy for agent teams subject to
reward threshold - For agent team
- Decentralized POMDP framework used
- Agents know initial joint belief state
- No communication possible between agents
- For adversary
- Can calculate the agents belief state
- Knows the agents policy
- Exploits the action predictability
- For Dec-POMDP-
- Deterministic policy maps observation history to
action. - Randomized Policy maps observation history to
action probability distribution.
17Policy trees Deterministic vs Randomized
Deterministic Policy Tree
Randomized Policy Tree
18RDR Rolling Down Randomization
- Input
- Best ( local or global ) deterministic policy
- Percent of reward loss
- d parameter Number of turns each agent gets
- Ex d .5 gt Number of steps 1/d 2
- Each agent gets one turn ( for 2 agent case )
- Single agent MDP problem at each step
- For agent 1s turn
- Fix policy of other agents ( Agent 2 )
- Find randomized policy
- Maximizes joint entropy ( w1 Entropy(agent1)
w2 Entropy(agent2) ) - Maintains joint reward above threshold
19RDR d .5
Agent 1 Maximize joint entropy Joint Reward gt 90
Max Reward
Reward 90
Agent 2 Maximize joint entropy Joint reward gt 80
80 of Max Reward
20RDR details
- For single agent sufficient
statistic - Not sufficient for multi agent case
- For multiagent case with
- Deterministic policy (other agents policies
fixed) - Reason about current world state observation
history of other agents - Randomized policy
- Current world state, action and observation
history of other agents - Define extended state
- where
- Joint belief state is a probability distribution
over extended states.
21New Transition and Observation functions
- Transition function for deterministic case
-
- Transition function of RDR
- Observation function remains the same
22Belief Update Rule
- Original Belief Update Rule
- New Belief Update Rule
23Experimental Results Reward Threshold vs
Weighted Entropy ( Averaged 10 instances )
24Future work Dec-MDPs with Bandwidth Constraints
- Agents teams can communicate
- Limited bandwidth assumed
- Bandwidth modeled as a shared resource constraint
- Typical policy randomization formulation
25Dec-POMDPs with Bandwidth Constraints
- RDR assumes no communication
- Communication in Dec-POMDPs
- With Deterministic Policies Nair et al, 04
- Communicate observation histories
- Compress belief states and increase expected
rewards - With Randomized Policies
- Can communicate observation histories
- Or communicate action taken
- Optimization Problem Optimal allocation of
bandwidth between observation histories and
actions st Expected reward maximized.
26Incorporating Adversary Models
- Real world situation
- Known Adversary cannot target leftmost region of
mission - No UAV patrolling needed there
- Prior knowledge to be incorporated while
randomizing policy. - Partial adversary models Standard framework
needs to be developed. - Optimal plan for modeled part of adversary
- Plan for the unmodeled part also.
27Timeline
Defense
Thesis Writing
Experimental Evaluation
Incorporating Adversarial Models
Experimental Evaluation
Dec-MDP/POMDP with communication constraints
Proposal
March 06
April 06
June 06
August 06
November 06
Feb 07
March 07
28Other Work
- Self-interest vs Team-interest
- Electric Elves Domain
- Teamwork with resource constraints
- Developed the EMTDP framework
- Maximize expected team reward while bounding
expected resource consumption - CRLP (Convex Combination for Randomization)
algorithm - Heuristic algorithm for single agent policy
randomization - Communication issue in Dec-MDPs
- Communication increases security
29Summary
- Intentional randomization as main focus
- Single agent case
- BRLP algorithm introduced
- Multi agent case
- RDR algorithm introduced
- Multi-criterion problem solved that
- Maximizes entropy
- Maintains Reward gt Threshold
30Thank You
- Any comments/questions ??
31(No Transcript)