Introduction to Reinforcement Learning

About This Presentation

Title:

Introduction to Reinforcement Learning

Description:

Hopelessly non-theoretical and crazy: Training V using non-stationary (no convergence proof) ... RL Success Stories/Videos. U. Michigan RL wiki page: 'keep ... – PowerPoint PPT presentation

Number of Views:166

Avg rating:3.0/5.0

Slides: 113

Provided by: IBMU288

Learn more at: http://www1.cs.columbia.edu

more less

Transcript and Presenter's Notes

Title: Introduction to Reinforcement Learning

1
Introduction to Reinforcement Learning

Gerry Tesauro
IBM T.J.Watson Research Center
http//www.research.ibm.com/infoecon
http//www.research.ibm.com/massdist

2
Outline

Statement of the problem
What RL is all about
How its different from supervised learning
Mathematical Foundations
Markov Decision Problem (MDP) framework
Dynamic Programming Value Iteration, ...
Temporal Difference (TD) and Q Learning
Applications Combining RL and function
approximation

3
Acknowledgement

Lecture material shamelessly adapted from R. S.
Sutton and A. G. Barto, Reinforcement Learning
Book published by MIT Press, 1998
Available on the web at RichSutton.com
Many slides shamelessly stolen from web site

4
Basic RL Framework

1. Learning with evaluative feedback
Learners output is scored by a scalar signal
(Reward or Payoff function) saying how well
it did
Supervised learning Learner is told the correct
answer!
May need to try different outputs just to see how
well they score (exploration )

5
6

6
7
Basic RL Framework

2. Learning to Act Learning to manipulate the
environment
Supervised learning is passive Learner doesnt
affect the distribution of exemplars or the class
labels

8
9
Basic RL Framework

Learner has to figure out which action is best,
and which actions lead to which states. Might
have to try all actions! ?
Exploration vs. Exploitation when to try a
wrong action vs. sticking to the best action

10
Basic RL Framework

3. Learning Through Time
Reward is delayed (Act now, reap the reward
later)
Agent may take long sequence of actions before
receiving reward
Temporal Credit Assignment Problem Given
sequence of actions and rewards, how to assign
credit/blame for each action?

11
12

12
13

13
14

Agents objective is to maximize expected value
of return R sum of future rewards
? is a discount parameter (0 ? ? ? 1)
Example Cart-Pole Balancing Problem
reward -1 at failure, else 0
expected return -?k for k
steps to failure
reward maximized by making
k? ?

We consider non-deterministic environments
Action at in state st ?
Probability distribution of rewards rt1
Probability distribution of new states st1
Some environments have nice property
distributions are history-independent and
stationary. These are called Markov environments
and the agents task is a Markov Decision
Problem (MDP)

An MDP specification consists of
list of states s ? S
list of legal action set A(s) for every s
set of transition probabilities for every s,a,s
set of expected rewards for every s,a,s

Given an MDP specification
Agent learns a policy ?
deterministic policy ? (s) action to take in
state s
non-deterministic policy ? (s,a) probability of
choosing action a in state s
Agents objective is to learn the policy that
maximizes expected value of return Rt
Value Function associated with a policy tells
us how good the policy is. Two types of value
functions ...

State-Value Function V? (s) Expected return
starting in state s and following policy ?
Action-Value Function Q? (s,a) Expected return
starting from action a in state s, and then
following policy ?

19
Bellman Equation for a Policy ?

The basic idea
Apply expectation for state s under policy ?
A linear system of equations for V? unique
solution

21
22
Why V, Q are useful

Any policy ? that is greedy w.r.t. V or Q is an
optimal policy ?.
One-step lookahead using V
Zero-step lookahead using Q

23
Two methods to solve for V, Q

Policy improvement given a policy ?, find a
better policy ?.
Policy Iteration Keep repeating above and
ultimately you will get to ?.
Value Iteration Directly solve Bellmans
optimality equation, without explicitly writing
down the policy.

24
Policy Improvement

Evaluate the policy given ?, compute V? (s) and
Q? (s,a) (from linear Bellman equations).
For every state s, construct new policy do the
best initial action, and then follow policy ?
thereafter.
The new policy is greedy w.r.t. Q? (s,a) and V?
(s)
? V? (s) ? V? (s)
? ? ? ? in our partial ordering.

25
Policy Improvement, contd.

What if the new policy has the same value as the
old policy? ( V? (s) V? (s) for all s)
But this is the Bellman Optimality equation if
V? solves it, then it must be the optimal value
function V.

26
27
Value Iteration

Use the Bellman Optimality equation
to define an iterative bootstrap
calculation
This is guaranteed to converge to a unique V
(backup is a contraction mapping)

28
Summary of DP methods

Guaranteed to converge to ? in polynomial time
(in size of state space) in practice often
faster than linear
The method of choice if you can do it.
Why it might not be doable
your problem is not an MDP
the transition probs and rewards
are unknown or too hard to specify
Bellmans curse of dimensionality
the state space is too big (gtgt O(106)
states)
RL may be useful in these cases

29
Monte Carlo Methods

Estimate V? (s) by sampling
perform a trial run the policy starting from s
until termination state reached measure actual
return Rt
N trials average Rt accurate to 1/sqrt(N)
no bootstrapping not using V(s) to estimate
V(s)
Two important advantages of Monte Carlo
Can learn online without a model of the
environment
Can learn in a simulated environment

30
31
Temporal Difference Learning

Error signal difference between current estimate
and improved estimate drives change of current
estimate
Supervised learning error
error(x) target_output(x) - learner_output(x)
Bellman error (DP)
1-step full-width lookahead - 0-step
lookahead
Monte Carlo error
error(s) ltRt gt - V(s)
many-step sample lookahead - 0-step lookahead

32
TD error signal

Temporal Difference Error Signal take one step
using current policy, observe r and s, then
1-step sample lookahead - 0-step lookahead
In particular, for undiscounted sequences with no
intermediate rewards, we have simply
Self-consistent prediction goal predicted
returns should be self-consistent from one time
step to the next (true of both TD and DP)

Learning using the Error Signal we could just do
a reassignment
But its often a good idea to learn
incrementally
where ? is a small learning rate parameter
(either constant, or decreases with time)
the above algorithm is known as TD(0)
convergence to be discussed later...

34
Advantages of TD Learning

Combines the bootstrapping (1-step
self-consistency) idea of DP with the sampling
idea of MC maybe the best of both worlds
Like MC, doesnt need a model of the environment,
only experience
TD, but not MC, can be fully incremental
you can learn before knowing the final outcome
you can learn without the final outcome (from
incomplete sequences)
Bootstrapping ? TD has reduced variance compared
to Monte Carlo, but possibly greater bias

35
36

36
37

37
38

38
39
The point of the ? parameter

(My view) ? in TD(?) is a knob to twiddle
provides a smooth interpolation between ?0 (pure
TD) and ?1 (pure MC)
For many toy grid-world type problems, can show
that intermediate values of ? work best.
For real-world problems, best ? will be highly
problem-dependent.

40
Convergence of TD (?)

TD(?) converges to the correct value function
V? (s) with probability 1 for all ?. Requires
lookup table representation (V(s) is a table),
must visit all states an infinite of times,
a certain schedule for decreasing ? (t).
(Usually ? (t) 1/t)
BUT TD(?) converges only for a fixed policy.
What if we want to learn ? as well as V? We
still have more work to do ...

41
Q-Learning TD Idea to Learn ?

Q-Learning (Watkins, 1989) one-step sample
backup to learn action-value function Q(s,a).
The most important RL algorithm in use today.
Uses one-step error
to define an incremental learning algorithm
where ?(t) follows same schedule as in TD
algorithm.

42
Nice properties of Q-learning

Q guaranteed to converge to Q w/probability 1.
Greedy
guaranteed to converge to ?.
But (amazingly), dont need to follow a fixed
policy, or the greedy policy, during learning!
Virtually any policy will do, as long as all
(s,a) pairs visited infinitely often.
As with TD, dont need a model, can learn online,
both bootstraps and samples.

43
RL and Function Approximation

DP infeasible for many real applications due to
curse of dimensionality S too big.
FA may provide a way to lift the curse
complexity D of FA needed to capture regularity
in environment may be ltlt S.
no need to sweep thru entire state space train
on N plausible samples and then generalize to
similar samples drawn from the same distribution.
PAC learning tells us generalization error D/N
? N need only scale linearly with
D.

44
RL Gradient Parameter Training

Recall incremental training of lookup tables
If instead V(s) V? (s), adjust ? to reduce MSE
(R-V(s))2 by gradient descent

Example TD(?) training of neural networks
(episodic ?1 and intermediate r 0)

46
Case-Study Applications

Several commonalities
Problems are more-or-less MDPs
S is enormous ? cant do DP
State-space representation critical use of
features based on domain knowledge
FA is reasonably simple (linear or NN)
Train in a simulator! Need lots of experience,
but still ltlt S
Only visit plausible states only generalize to
plausible states

47
48
Learning backgammon using TD(?)

Neural net observes a sequence of input patterns
x1, x2, x3, , xf sequence of board positions
occurring during a game
Representation Raw board description ( of White
or Black checkers at each location) using simple
truncated unary encoding. (hand-crafted
features added in later versions)
At final position xf, reward signal z given
z 1 if White wins
z 0 if Black wins
Train neural net using gradient version of TD(?)
Trained NN output Vt V (xt , w) should estimate
prob (White wins xt )

49
50
Q Who makes the moves??

A Let neural net make the moves itself, using
its current evaluator score all legal moves, and
pick max Vt for White, or min Vt for Black.
Hopelessly non-theoretical and crazy
Training V? using non-stationary ? (no
convergence proof)
Training V? using nonlinear func. approx. (no
cvg. proof)
Random initial weights ? Random initial play!
Extremely long sequence of random moves and
random outcome ? Learning seems hopeless to a
human observer
But what the heck, lets just try and see what
happens...

TD-Gammon can teach itself by playing games
against itself and learning from the outcome
Works even starting from random initial play and
zero initial expert knowledge (surprising!) ?
achieves strong intermediate play
add hand-crafted features advanced level of play
(1991)
2-ply search strong master play (1993)
3-ply search superhuman play (1998)
TD-Leaf
n-step TD backups in 2-player
games (Beal Baxter et al.) great results
for checkers and chess

52
RL Success Stories/Videos

U. Michigan RL wiki page
keep-away in Robocup simulator
Aibo fast walk gate ball acquisition
Humanoid robot Air hockey
Helicopter aerobatics (Ng et al.)
Human flies helicopter for 10-20 mins
Perform System Identification learn model of
helicopter dynamics
Using model, train RL policy in simulator

53
Cell-phone channel allocation

S. Singh and D. Bertsekas, NIPS-96
Dynamic resource allocation assign channels to
calls in a cell cant interfere with neighboring
cell
Problem is a real-time discrete-event MDP with
huge state space 7049 states
Objective maximize

54
Modified Bellman optimality equation

Modify equation to handle continuous time,
discrete events
where s configuration, erandom event
(arrival, handoff, departure) aaction, ?trandom
time to next event, c(s,a, ?t) effective
immediate payoff

represent s?x using 2 features for each cell
Availability of free channels in a cell
Cell-channel packing of times channel is used
in 4-cell radius
represent V using linear FA V ??x
train in simulator using gradient version of
TD(0)

54
56
RL training results (BDCLbest prev. algo.)
55
57

56
58

57
59

58
60

59
61

60
62

61
63

62
64

63
65

64
66
RL for Spoken Dialogue Systems

Singh, Litman, Kearns, Walker (JAIR 2002)
Sequence of human-computer speech interactions
Use in DB-query system NJFun database of
leisure activities in NJ, organized by (type,
location, time)
Humans arent MDPs, but pretend they are devise
MDP representation of system-human interaction

Severely restrict state space 7 state variables
and 42 choice-state combinations

Severely restrict the policy 2 actions possible
in each choice-state ? 242 possible
policies train using random exploration
Actions are spoken requests to the user,
classified as
system initiative Please state the type of
activity you are interested in
user initative How may I help you?
mixed initiative Please say the location you
are interested in. You can also tell me the
time.
confirmation of an attribute Did you say you
are interested in going to a museum?
Train on a corpus of 311 dialogues (using ATT
volunteers) test trained system on 124 test
dialogues. Reward after each dialogue is both
objective (was the specific task completed
exactly or partially) as well as subjective
(good, bad, or so-so performance) from the
human
Small MDP but dont have a model! ? Do Q-Learning
using sample trajectories with the above
random-exploration policy

Results Learned policy much better than random
exploration

Results Learned policy much better than standard
policies

70
72
RL Mashups

RL semi-supervised learning
RL active learning
RL metric learning
RL dimensionality reduction
Bayesian RL
RL SVMs/kernel methods
RL semi-definite programming
RL Gaussian process models
etc. etc.
NIPS 2006 workshop Towards A New Reinforcement
Learning www.jan-peters.net/Research/NIPS2006

73
Final remarks on RL

Can solve MDPs on-line, in real environment,
without knowing underlying MDP
Function Approximators can avoid the curse of
dimensionality
Beyond MDPs active research in RL for
high-level planning,
structured (e.g. factored, hierarchical) MDPs,
partially observable MDPs (POMDPs),
history dependent problems,
non-stationary problems,
multi-agent problems
For more info, go to RichSutton.com

74
Game Theory and Multi-Agent Learning
75
Outline

Description of the problem
Tools and concepts from RL game theory
Naïve approaches to multi-agent learning
ordinary single-agent RL
evolutionary game theory
Sophisticated approaches
minimax-Q, FriendOrFoe-Q (Littman),
tinkering with learning rates WoLF (Bowling),
strategic teaching (Camerer)
Challenges and Opportunities

76
Normal single-agent learning

Assume that environment has observable states,
characterizable expected rewards and state
transitions, and all of the above is stationary
(MDP-ish)
Non-learning, theoretical solution to fully
specified problem DP formalism
Learning solve by trial and error without a full
specification RL exploration, Monte Carlo, ...

77
Multi-Agent Learning Problem

Agent tries to solve its learning problem, while
other agents in the environment also are trying
to solve their own learning problems. ?
challenging non-stationarity.
Main scenarios (1) cooperative (2)
self-interest (many deep issues swept under the
rug)
Agent may know very little about other agents
payoffs may be unknown
learning algorithms unknown
Traditional method of solution game theory (uses
several questionable assumptions)

78
MAL needs foundational principles!

A precise problem formulation is still lacking!
See If Multi-Agent Learning is the Answer, What
is the Question? Shoham et al, 2006
Some (debatable) MAL objectives
Learning should converge to a stationary strategy
In self-play learning (all agents use same
learning algorithm), learners should jointly
converge to an equilibrium strategy
Learning should achieve payoffs as good as a
best-response to other agents strategies
(Worst case bound) Learning should guarantee a
minimum payoff (security payment, no-regret
property)

79
Game Theory

Provides essential theoretical/conceptual
background for tackling multi-agent learning
Wikipedia definition
Game theory is most often described as a branch
of applied mathematics and economics that studies
situations where players choose different actions
in an attempt to maximize their returns. The
essential feature, however, is that it provides a
formal modelling approach to social situations in
which decision makers interact with other minds.
Today, widely used in politics, business,
economics, biology, psychology, computer science
etc.

80
Fundamental Postulate of Game Theory
Rationality

A rational player/agent will make decisions that
maximize her individual expected utility (
expected payoff for simplicity) given her
understanding/beliefs about the problem. Also,
perfectly indifferent to payoffs received by
other players.

81
Basics of game theory

A game is specified by players (1N), actions,
and (expected) payoff matrices (functions of
joint actions)
Bs action
As action
As payoff
Bs payoff
If payoff matrices are identical, A and B are
cooperative, else non-cooperative (zero-sum
purely competitive)

82
Basic lingo(2)

Games with no states (bi)-matrix games
Games with states stochastic games, Markov
games (state transitions are functions of joint
actions)
Games with simultaneous moves normal form
Games with alternating turns extensive form
No. of rounds 1 one-shot game
No. of rounds gt 1 repeated game
deterministic action policy pure strategy
non-deterministic action policy mixed strategy
e.g. Prob(R,P,S) (½,¼,¼)

83
Stochastic vs. Matrix Games

A stochastic game (a.k.a. Markov game )
generalizes MDPs to multiple agents
finite state space S
joint action set
stationary reward distribution
stationary transition probabilities
A matrix game has no state information, only
joint actions and payoffs (S 1)

84
Basic Analysis

Agent is mixed strategy xi is a best-response to
others x-i if it maximizes payoff given x-i
xi is a dominant strategy if it maximizes payoff
regardless of what others do
A joint strategy x is an equilibrium if each
agents strategy is simultaneously a
best-response to everyone elses strategy, i.e.
no incentive to deviate. Nash equilibrium is the
main one, but there are others (e.g. correlated
equilibrium)
A Nash equilibrium always exists, but may be
exponentially many of them, and very hard to
compute
equilibrium coordination (players agree on which
eqm to choose) is a big problem

85
What about imperfect information games?

Nash eqm. requires full observability of all game
info. For imperfect info. games (e.g. each
player has private info), corresponding concept
is Bayes-Nash equilibrium (Nash plus Bayesian
inference over hidden information). Even more
intractable than regular Nash.

86
Pros and Cons of game theory

Game theory provides a basic conceptual/theoretica
l framework for thinking about multi-agent
learning.
Game theory is appropriate provided that
Game is stationary and fully specified
X
Enough computer power to compute equilibrium
X
Can assume other agents are also game theorists
X
Can solve equilibrium coordination problem.
X
Above conditions rarely hold in real applications
Multi-agent learning is not only a fascinating
problem, it may be the only viable option.

87
Real-Life vs. Game Theory games

NFL playoffs
World Series of Poker
World of Warcraft
Buying a house
Salary negotiations
Competitive pricing
Best Buy vs. Circuit City
Airline fare wars
OPEC production cuts
NASDAQ, NYSE,
FCC spectrum auctions

Matching Pennies
Rock-Paper-Scissors
Prisoners Dilemma
Battle-of-the-Sexes
Chicken
Ultimatum

88
Assumptions in Normal-Form Games

Game specification is fully known actions and
payoffs are fully observable by all players
Players act simultaneously, i.e. without
observing actions of others (not scalable!)
Assume no communication between players, or it
doesnt affect play (communication is cheap
talk)
Basic analysis assumes the game is only played
once (called one-shot)

89
Presentation of Rock Paper Scissors Payoffs in a
Bimatrix

This is a zero-sum game since for each pair of
joint actions, the players payoffs add up to
zero.
This is a symmetric game invariant under
swapping of player labels
This game has a unique mixed strategy Nash
equilibrium both players play uniform random
strategies prob(R,P,S)(1/3,1/3,1/3)

90
Prisoners Dilemma Game
91
Prisoners Dilemma Game
Whatever Prisoner 2 does, the best that Prisoner
1 can do is Confess
92
Prisoners Dilemma Game
Whatever Prisoner 1 does, the best that Prisoner
2 can do is Confess.
93
Prisoners Dilemma Game
A strategy is a dominant strategy if it is a
players strictly best response to any strategies
the other players might pick. A dominant strategy
equilibrium is a strategy combination consisting
of each players dominant strategy.
Each player has a dominant strategy to
Confess. The dominant strategy equilibrium is
(Confess,Confess)
94
Prisoners Dilemma Game
The payoff in the dominant strategy equilibrium
(-8,-8) is worse for both players than (-1,-1),
the payoff in the case that both players hold
out. Thus, the Prisoners Dilemma Game is a game
of social conflict.
Opportunity for multi-agent learning by learning
during repeated play, the Pareto optimal solution
(-1,-1) can emerge as a result of learning (also
can arise in evolutionary game theory).
95
Battle of the Sexes
96
Battle of the Sexes

This game has
no (iterated) dominant strategy equilibrium

97
Battle of the Sexes

This game has
no (iterated) dominant strategy equilibrium

98
Battle of the Sexes

This game has
no (iterated) dominant strategy equilibrium
two Nash equilibria (Prize Fight, Prize Fight)
and (Ballet, Ballet)

99
Battle of the Sexes
This game has two Nash equilibria
How can these two players coordinate ?
100
Multiagent Q-learning desiderata

performs well vs. arbitrarily adapting other
agents
best-response probably impossible
Doesnt need correct model of other agents
learning algorithms
But modeling is fair game
Doesnt need to know other agents payoffs
Estimate other agents strategies from
observation
do not assume game-theoretic play
No assumption of stationary outcome population
may never reach eqm, agents may never stop
adapting
Self-play convergence to repeated Nash would be
nice but not necessary. (unreasonable to seek
convergence to a one-shot Nash)

101
Naïve Approaches to Multi-Agent Learning

Basic idea agent adapts, ignoring
non-stationarity of other agents strategies
1. Evolutionary game theory Replicator
Dynamics models large population of agents
using different strategies, fittest agents breed
more copies.
Let x population strategy vector, and xk
fraction of population playing strategy k.
Growth rate then
Above eqn also derived from an imitation model
NE are fixed points of above equation, but not
necessarily attractors (unstable or neutral
stable)

102
Many possible dynamic behaviors...

limit cycles attractors
unstable f.p.
Also saddle points, chaotic orbits, ...

103
Replicator dynamics auction bidding strategies

104
More Naïve Approaches

2. Iterated Gradient Ascent (Singh, Kearns and
Mansour) Again does a myopic adaptation to
other players current strategy.
Coupled system of linear equations u is linear
in xi and x-i
Analysis for two-player, two-action games either
converges to a Nash fixed point on the boundary
(at least one pure strategy), or get limit cycles

105
Further Naïve Approaches

3. Dumb Single-Agent Learning Use a single-agent
algorithm in a multi-agent problem hope that it
works
No-regret learning by pricebots (Greenwald
Kephart)
Simultaneous Q-learning by pricebots (Tesauro
Kephart)
In many cases, this actually works learners
converge either exactly or approximately to
self-consistent optimal strategies
Naïve approaches are rational i.e. they
converge to a best response against a stationary
opponent
but they generally dont converge to Nash
equilibrium

106
A Fancier Approach

4. No-regret learning (Hart Mas-Colell, Freund
Schapire, many others) Define regret for
playing a sequence si instead of constant action
aj for t time steps
Then choose next action with probability
proportional to
prob (action j)
This has a worst-case guarantee that asymptotic
regret per time step ?0, i.e., will be as good as
best (constant) action choice

107
Sophisticated approaches

Takes into account the possibility that other
agents strategies might change.
4. Equilibrium Q-learners
Minimax-Q (Littman) converges to Nash
equilibrium for two-player zero-sum stochastic
games
FriendOrFoe-Q (Littman) convergent algorithm for
games where every other player can be identified
as friend (same payoffs as me) or foe
(payoffs are zero-sum)
These algorithms converge to Nash equilibrium but
arent rational since they dont best-respond
to a fixed opponent

108
More sophisticated approaches...

5. Varying learning rates
WoLF Win or Learn Fast (Bowling) agent
reduces its learning rate when performing well,
and increases when doing badly. Improves
convergence of IGA and policy hill-climbing
GIGA-WoLF (Bowling) Combines the IGA algorithm
with WoLF idea. Provably convergent no-regret.

109
More sophisticated approaches...

6. Strategic Teaching recognizes that other
players strategy are adaptive
A strategic teacher may play a strategy which is
not myopically optimal (such as cooperating in
Prisoners Dilemma) in the hope that it induces
adaptive players to expect that strategy in the
future, which triggers a best-response that
benefits the teacher. (Camerer, Ho and Chong)

110
Theoretical Research Challenges

Proper theoretical formulation?
No short-cut hypothesis Massive on-line search
a la Deep Blue to maximize expected long-term
reward
(Bayesian) Model and predict behavior of other
players, including how they learn based on my
actions (beware of infinite model recursion)
trial-and-error exploration
continual Bayesian inference using all evidence
over all uncertainties (Boutilier Bayesian
exploration)
When can you get away with simpler methods?

111
Real-World Opportunities