Title: A Polynomialtime Nash Equilibrium Algorithm for Repeated Stochastic Games
1A Polynomial-time Nash Equilibrium Algorithm for
Repeated Stochastic Games
- Enrique Munoz de Cote
- Michael L. Littman
2Main Result
Main Result
Concretely, we address the following
computational problem
v2
egalitarian line
- Given a repeated stochastic game, return a
strategy profile that is a Nash equilibrium
(speci?cally one whose payo?s match the
egalitarian point) of the average payo? repeated
stochastic game in polynomial time.
v1
Convex hull of the average payoffs
3Framework
Multiple states
stochastic games
MDPs
Decision Theory, Planning
matrix games
Single state
Multiple agents
Single agent
4Stochastic Games (SG)
backgrounds
- Superset of MDPs NFGs
- S is the set of states
- T is the transition function
- Such that
5A Computational Example SG version of chicken
backgrounds
- actions U, D, R, L, X
- coin flip on collision
- Semiwalls (50)
- collision -5
- step cost -1
- goal 100
- discount factor 0.95
- both can get goal.
SG of chicken Hu Wellman, 03
6Strategies on the SG of chicken
backgrounds
- Average expected reward
- (88.3,43.7)
- (43.7,88.3)
- (66,66)
- (43.7,43.7)
- (38.7,38.7)
- (83.6,83.6)
discount factor .95
7Equilibrium values
backgrounds
- Average total reward on equilibrium
- Nash
- (88.3,43.7) very imbalanced, inefficient
- (43.7,88.3) very imbalanced, inefficient
- (53.6,53.6) ½ mix, still inefficient
- Correlated
- (43.7,88.3,43.7,88.3)
- Minimax
- (43.7,43.7)
- Friend
- (38.7,38.7)
Nash computationally difficult to find in general
8Repeated Games
backgrounds
What if players are allowed to play multiple
times?
- Many more equilibrium alternatives (Folk
theorems) - Equilibrium strategies
- Can depend on past interactions
- Can be randomized
- Nash equilibrium still exists.
v2
v1
Convex hull of the average payoffs
9Nash equilibrium of the repeated game
- Folk theorems. For any set of average payoffs
that is, - Strictly enforceable
- Feasible
- there exist equilibrium profile strategies that
achieve these payoffs
- Mutual advantage strategies up and right of
disagreement point v (v1,v2) - Threats attack strategies against deviations
v
10Egalitarian equilibrium point
- Folk theorems conceptual drawback infinitely
many feasible and enforceable strategies
- Egalitarian line. line where payoffs are equally
high above v
v2
egalitarian line
P
Egalitarian point. Maximizes the minimum
advantage of the players rewards
v
v1
Convex hull of the average payoffs
11How? (the short story version)
Repeated SG Nash algorithm?result
- Compute attack and defense strategies.
- Solve two linear programming problems.
- The algorithm searches for a point
egalitarian line
P
where
Convex hull of a hypothetical SG
- P is the point with the highest egalitarian
value.
12Game representation
- Folk theorems can be interpreted computationally
- Matrix form Littman Stone, 2005
- Stochastic game form Munoz de Cote Littman,
2008 - Define a weighted combination value
- A strategy profile (p) that achieves sw(pp) can
be found by modeling an MDP
13Markov Decision Processes
- We use MDPs to model 2 players as a meta-player
- Return joint strategy profile that maximizes a
weighted combination of the players payoffs - Friend solutions
- (R0, p1) MDP(1),
- (L0, p2) MDP(0),
- A weighted solution
- (P, p) MDP(w)
L0
P
R0
v
14The algorithm
Repeated SG Nash algorithm?result
folk
FolkEgal(U1,U2, e)
- Compute
- attack1, attack2,
- defense1, defense2 and
- Rfriend1, Lfriend2
- Find egalitarian point and its strategy proflile
- If R is left of egalitarian line PR
- elseIf L is right of egalitarian line P L
- Else egalSearch(R,L,T)
L
egalitarian line
L
PR
PL
\
R
\
R
\
Convex hull of a hypothetical SG
15The key subroutine
EgalSearch(L,R,T)
- Finds intersection between X and egalitarian line
- Close to a binary search
- Input
- Point L (to the left of egalitarian line)
- Point R (to the right of egalitarian line)
- A bound T on the number of iterations
- Return
- The egalitarian point P (with accuracy e)
- Each iteration solves an MDP(w) by finding a
solution to
16Complexity
Repeated SG Nash algorithm?result
- Dissagreement point (accuracy e) 1 / (1 ?),
1 /e, Umax - MDPs are solved in polynomial time Puterman,
1994 - The algorithm is polynomial iff T is bounded by a
polynomial.
Result
Running time. Polynomial in The discount factor
1 / (1 ?) The approximation factor 1
/e Magnitude of largest utility Umax
17SG version of the PD game
experiments
B
A
B
A
18Compromise game
experiments
B
A
A
A
B
A
B
B
B
B
A
A
19Asymmetric game
experiments
B
A
A
A
B
20Thanks for your attention!