Reasoning in Uncertain Adversarial Environments in AgentMultiagent Systems - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Reasoning in Uncertain Adversarial Environments in AgentMultiagent Systems

Description:

Milind Tambe, Leana Golubchik, Gaurav S. Sukhatme, Sarit Kraus, Stacy ... Rock-Paper-Scissors game, Littman 1994. CMDPs. Constrained MDPs, Altman 1999. Privacy ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 32

Provided by: Milind8

Category:

more less

Transcript and Presenter's Notes

Title: Reasoning in Uncertain Adversarial Environments in AgentMultiagent Systems

1
Reasoning in Uncertain Adversarial Environments
in Agent/Multiagent Systems

Praveen Paruchuri
University of Southern California

Guidance Committee Milind Tambe, Leana
Golubchik, Gaurav S. Sukhatme, Sarit Kraus, Stacy
Marsella, Fernando Ordonez
2
Motivation The Prediction Game

An UAV (Unmanned Aerial Vehicle)
Flies between the 4 regions
Can you predict the UAV-fly patterns ??
Pattern 1
1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,
Pattern 2
1, 4, 3, 1, 1, 4, 2, 4, 2, 3, 4, 3, (as
generated by 4-sided dice)
Can you predict if 100 numbers in pattern 2 are
given ??
Randomization decreases Predictability
Increases Security

3
Problem Definition

Problem Provide security for agent/agent-team
acting in uncertain adversarial environments.
Assumptions for Agent/agent-team
Acting in uncertain adversarial environments
Adversary is unobservable
Adversarys actions/capabilities or payoffs are
unknown or difficult to model explicitly
Assumptions for Adversary
Can see the agents state or belief state
Knows the agents plan/policy
Exploits the action predictability

4
Solution Technique

Technique developed
Intentional policy randomization for security
MDP/POMDP framework to handle sequential decision
making under environment uncertainty
MDP ? Markov Decision Process
POMDP ? Partially Observable MDP
Increase Security gt Solve Multi-criteria problem
for agents
Maximize action unpredictability (Policy
randomization)
Maintain reward above threshold (Quality
constraints)

5
Domains

Scheduled activities at airports like security
check, refueling etc
Observed by terrorists
Randomization of schedules helpful
UAV/UAV-team patrolling humanitarian mission
Adversary monitoring UAV schedule to disrupt
mission
Can disrupt food, harm refugees, shoot down UAVs
etc
Randomize UAV patrol policy (More domains in
report)

6
My Contributions

Two main contributions
Single Agent Case
Formulate as Non linear program Entropy based
metric
Convert to Linear Program called BRLP ( Binary
search for randomization )
Randomize single agent policies with reward gt
threshold
Multi Agent Case RDR ( Rolling Down
Randomization )
Randomized policies for decentralized POMDPs
Threshold on team reward

7
Related work

Randomized policies in literature
Stochastic Games
Rock-Paper-Scissors game, Littman 1994
CMDPs
Constrained MDPs, Altman 1999
Privacy
Act on information while maintaining privacy,
Otterloo 2005
Security via randomization
Patrol units for security, Carroll 2005
Randomized patrol sentry vehicles, Lewis 2005
Randomized Police Patrol, Billante 2003

8
Plan for Talk

Achieved Contribution versus Expected Contribution

Increase Security for Agent/Agent Team Acting in
Uncertain, Adversarial domains
No Adversary Model
Partial Adversary Model
No Communication
Limited Communication
Dec- POMDP based Agent Team
Dec-MDP based Agent Team
Dec-POMDP Based Agent Team
MDP based Single Agent
9
MDP based single agent case

MDP is tuple lt S, A, P, R gt
S Set of states
A Set of actions
P Transition function
R Reward function
Basic terms used
x(s,a) Expected times action a is taken in
state s
Policy (as function of MDP flows)

10
Entropy Measure of randomness

Randomness or information content quantified
using Entropy ( Shannon 1948 )
Entropy for MDP -
Additive Entropy Add entropies of each state
Weighted Entropy Weigh each state by it
contribution to total flow
where alpha_j is the initial flow of the system

11
Tradeoff Reward vs Entropy

Non-linear Program Max entropy, Reward above
threshold
Objective (Entropy) is non-linear
BRLP ( Binary Search for Randomization LP )
Linear Program
No entropy calculation, Entropy as function of
flows

12
BRLP

Given input and target reward
Poly-time convergence to within of target
reward.
Monotonicity Entropy decreases or constant with
increasing reward.
Control through
Input can be any high entropy policy
One such input is the uniform policy
Equal probability for all actions out of all
states
Controls the final policy structure

13
LP for Binary Search

Policy as function of and
Linear Program

14
BRLP in Action
Beta .5
1 - Max entropy
0 Deterministic Max Reward
Target Reward
15
Results (Averaged over 10 MDPs)

For a given reward threshold,
Highest entropy Expected Entropy Method 10
avg gain over BRLP
Fastest BRLP 7 fold average speedup over
Expected Entropy
These results are statistically significant
(t-Tests performed)

16
Multi Agent Case Problem

Maximize entropy for agent teams subject to
reward threshold
For agent team
Decentralized POMDP framework used
Agents know initial joint belief state
No communication possible between agents
For adversary
Can calculate the agents belief state
Knows the agents policy
Exploits the action predictability
For Dec-POMDP-
Deterministic policy maps observation history to
action.
Randomized Policy maps observation history to
action probability distribution.

17
Policy trees Deterministic vs Randomized
Deterministic Policy Tree
Randomized Policy Tree
18
RDR Rolling Down Randomization

Input
Best ( local or global ) deterministic policy
Percent of reward loss
d parameter Number of turns each agent gets
Ex d .5 gt Number of steps 1/d 2
Each agent gets one turn ( for 2 agent case )
Single agent MDP problem at each step
For agent 1s turn
Fix policy of other agents ( Agent 2 )
Find randomized policy
Maximizes joint entropy ( w1 Entropy(agent1)
w2 Entropy(agent2) )
Maintains joint reward above threshold

19
RDR d .5
Agent 1 Maximize joint entropy Joint Reward gt 90
Max Reward
Reward 90
Agent 2 Maximize joint entropy Joint reward gt 80
80 of Max Reward
20
RDR details

For single agent sufficient
statistic
Not sufficient for multi agent case
For multiagent case with
Deterministic policy (other agents policies
fixed)
Reason about current world state observation
history of other agents
Randomized policy
Current world state, action and observation
history of other agents
Define extended state
where
Joint belief state is a probability distribution
over extended states.

21
New Transition and Observation functions

Transition function for deterministic case
Transition function of RDR
Observation function remains the same

22
Belief Update Rule

Original Belief Update Rule
New Belief Update Rule

23
Experimental Results Reward Threshold vs
Weighted Entropy ( Averaged 10 instances )
24
Future work Dec-MDPs with Bandwidth Constraints

Agents teams can communicate
Limited bandwidth assumed
Bandwidth modeled as a shared resource constraint
Typical policy randomization formulation

25
Dec-POMDPs with Bandwidth Constraints

RDR assumes no communication
Communication in Dec-POMDPs
With Deterministic Policies Nair et al, 04
Communicate observation histories
Compress belief states and increase expected
rewards
With Randomized Policies
Can communicate observation histories
Or communicate action taken
Optimization Problem Optimal allocation of
bandwidth between observation histories and
actions st Expected reward maximized.

26
Incorporating Adversary Models

Real world situation
Known Adversary cannot target leftmost region of
mission
No UAV patrolling needed there
Prior knowledge to be incorporated while
randomizing policy.
Partial adversary models Standard framework
needs to be developed.
Optimal plan for modeled part of adversary
Plan for the unmodeled part also.

27
Timeline
Defense
Thesis Writing
Experimental Evaluation
Incorporating Adversarial Models
Experimental Evaluation
Dec-MDP/POMDP with communication constraints
Proposal
March 06
April 06
June 06
August 06
November 06
Feb 07
March 07
28
Other Work