Title: Reinforcement Learning
1Reinforcement Learning
- Yishay Mansour
- Tel-Aviv University
2Outline
- Goal of Reinforcement Learning
- Mathematical Model (MDP)
- Planning
3Goal of Reinforcement Learning
Goal oriented learning through interaction Contro
l of large scale stochastic environments with
partial knowledge.
Supervised / Unsupervised Learning
Learn from labeled / unlabeled examples
4Reinforcement Learning - origins
Artificial Intelligence Control
Theory Operation Research Cognitive Science
Psychology Solid foundations well established
research.
5Typical Applications
- Robotics
- Elevator control CB.
- Robo-soccer SV.
- Board games
- backgammon T,
- checkers S.
- Chess B
- Scheduling
- Dynamic channel allocation SB.
- Inventory problems.
6Contrast with Supervised Learning
The system has a state.
The algorithm influences the state distribution.
Inherent Tradeoff Exploration versus
Exploitation.
7Mathematical Model - Motivation
Model of uncertainty
Environment, actions, our knowledge.
Focus on decision making.
Maximize long term reward.
Markov Decision Process (MDP)
8Mathematical Model - MDP
Markov decision processes S- set of states A-
set of actions d - Transition probability R -
Reward function
Similar to DFA!
9MDP model - states and actions
Environment states
0.7
action a
0.3
Actions transitions
10MDP model - rewards
R(s,a) reward at state s for
doing action a (a random variable).
Example R(s,a) -1 with probability 0.5
10 with probability 0.35
20 with probability 0.15
11MDP model - trajectories
12MDP - Return function.
Combining all the immediate rewards to a single
value. Modeling Issues Are early rewards more
valuable than later rewards? Is the system
terminating or continuous?
Usually the return is linear in the immediate
rewards.
13MDP model - return functions
Finite Horizon - parameter H
Infinite Horizon
discounted - parameter glt1.
undiscounted
Terminating MDP
14MDP model - action selection
AIM Maximize the expected return.
Fully Observable - can see the entire state.
Policy - mapping from states to actions
Optimal policy optimal from any start state.
THEOREM There exists a deterministic optimal
policy
15Contrast with Supervised Learning
Supervised Learning Fixed distribution on
examples.
Reinforcement Learning The state distribution is
policy dependent!!! A small local change in the
policy can make a huge global change in the
return.
16MDP model - summary
- set of states, Sn. - set of k actions,
Ak. - transition function. - immediate reward
function. - policy. - discounted cumulative
return.
R(s,a)
17Simple example N- armed bandit
Single state.
Goal Maximize sum of immediate rewards.
a1
s
Given the model Greedy action.
a2
Difficulty unknown model.
a3
18N-Armed Bandit Highlights
- Algorithms (near greedy)
- Exponential weights
- Gi sum of rewards of action ai
- wi eGi
- Follow the leader
- Results
- For any sequence of T rewards
- Eonline gt maxi Gi - sqrtT log N
19Planning - Basic Problems.
Given a complete MDP model.
Policy evaluation - Given a policy p, estimate
its return.
Optimal control -
Find an optimal policy p (maximizes the return
from any start state).
20Planning - Value Functions
Vp(s) The expected return starting at state s
following p.
Qp(s,a) The expected return starting at state s
with action a and then following p.
V(s) and Q(s,a) are define using an optimal
policy p.
V(s) maxp Vp(s)
21Planning - Policy Evaluation
Discounted infinite horizon (Bellman Eq.)
Vp(s) Es p (s) R(s,p (s)) g Vp(s)
Linear system of equations.
22Algorithms - Policy Evaluation Example
A1,-1 g 1/2 d(si,a) sia p random
s0
s1
0
1
2
3
"a R(si,a) i
s2
s3
Vp(s0) 0 g p(s0,1)Vp(s1) p(s0,-1) Vp(s3)
23Algorithms -Policy Evaluation Example
Vp(s0) 5/3 Vp(s1) 7/3 Vp(s2) 11/3 Vp(s3)
13/3
A1,-1 g 1/2 d(si,a) sia p random
s0
s1
0
1
"a R(si,a) i
2
3
s2
s3
Vp(s0) 0 (Vp(s1) Vp(s3) )/4
24Algorithms - optimal control
State-Action Value function
Qp(s,a) E R(s,a) gEs (s,a) Vp(s)
Note
For a deterministic policy p.
25Algorithms -Optimal control Example
Qp(s0,1) 5/6 Qp(s0,-1) 13/6
A1,-1 g 1/2 d(si,a) sia p random
s0
s1
0
1
R(si,a) i
2
3
s2
s3
Qp(s0,1) 0 g Vp(s1)
26Algorithms - optimal control
CLAIM A policy p is optimal if and only if at
each state s
Vp(s) MAXa Qp(s,a) (Bellman Eq.)
PROOF Assume there is a state s and action a
s.t.,
Vp(s) lt Qp(s,a).
Then the strategy of performing a at state s (the
first time) is better than p.
This is true each time we visit s, so the policy
that performs action a at state s is better than
p.
p
27Algorithms -optimal control Example
A1,-1 g 1/2 d(si,a) sia p random
s0
s1
0
1
R(si,a) i
2
3
s2
s3
Changing the policy using the state-action value
function.
28Algorithms - optimal control
The greedy policy with respect to Qp(s,a) is
p(s) argmaxaQp(s,a)
The e-greedy policy with respect to Qp(s,a) is
p(s) argmaxaQp(s,a) with probability
1-e, and p(s) random action with
probability e
29MDP - computing optimal policy
1. Linear Programming 2. Value Iteration method.
3. Policy Iteration method.
30Convergence
- Value Iteration
- Drop in distance from optimal
- maxs V(s) Vt(s)
- Policy Iteration
- Policy can only improve
- ?s Vt1(s) ? Vt(s)
- Less iterations then Value Iteration, but
- more expensive iterations.
31Relations to Board Games
- state current board
- action what we can play.
- opponent action part of the environment
- value function probability of winning
- Q- function modified policy.
- Hidden assumption Game is Markovian
32Planning versus Learning
Tightly coupled in Reinforcement Learning
Goal maximize return while learning.
33Example - Elevator Control
Learning (alone) Model the arrival model
well.
Planning (alone) Given arrival model build
schedule
Real objective Construct a schedule while
updating model
34Partially Observable MDP
Rather than observing the state we observe some
function of the state.
Ob - Observable function. a random
variable for each states.
Example (1) Ob(s) snoise. (2) Ob(s) first
bit of s.
Problem different states may look similar.
The optimal strategy is history dependent !
35POMDP - Belief State Algorithm
Given a history of actions and observable
value we compute a posterior distribution for the
state we are in (belief state).
The belief-state MDP
States distribution over S (states of the
POMDP). actions as in the POMDP. Transition the
posterior distribution (given the observation)
We can perform the planning and learning on the
belief-state MDP.
36POMDP - Hard computational problems.
Computing an infinite (polynomial) horizon
undiscounted optimal strategy for a deterministic
POMDP is P-space-hard (NP-complete) PT,L.
Computing an infinite (polynomial) horizon
undiscounted optimal strategy for a stochastic
POMDP is EXPTIME-hard (P-space-complete) PT,L.
Computing an infinite (polynomial) horizon
undiscounted optimal policy for an MDP is
P-complete PT .
37Resources
- Reinforcement Learning (an introduction) Sutton
Barto - Markov Decision Processes Puterman
- Dynamic Programming and Optimal Control
Bertsekas - Neuro-Dynamic Programming Bertsekas
Tsitsiklis - Ph. D. thesis - Michael Littman