Reinforcement Learning

1 / 37
About This Presentation
Title:

Reinforcement Learning

Description:

discounted - parameter g 1. undiscounted. Terminating MDP ... Discounted infinite horizon (Bellman Eq.) Rewrite the expectation. Linear system of equations. ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 38
Provided by: csta3

less

Transcript and Presenter's Notes

Title: Reinforcement Learning


1
Reinforcement Learning
  • Yishay Mansour
  • Tel-Aviv University

2
Outline
  • Goal of Reinforcement Learning
  • Mathematical Model (MDP)
  • Planning

3
Goal of Reinforcement Learning
Goal oriented learning through interaction Contro
l of large scale stochastic environments with
partial knowledge.
Supervised / Unsupervised Learning
Learn from labeled / unlabeled examples
4
Reinforcement Learning - origins
Artificial Intelligence Control
Theory Operation Research Cognitive Science
Psychology Solid foundations well established
research.
5
Typical Applications
  • Robotics
  • Elevator control CB.
  • Robo-soccer SV.
  • Board games
  • backgammon T,
  • checkers S.
  • Chess B
  • Scheduling
  • Dynamic channel allocation SB.
  • Inventory problems.

6
Contrast with Supervised Learning
The system has a state.
The algorithm influences the state distribution.
Inherent Tradeoff Exploration versus
Exploitation.
7
Mathematical Model - Motivation
Model of uncertainty
Environment, actions, our knowledge.
Focus on decision making.
Maximize long term reward.
Markov Decision Process (MDP)
8
Mathematical Model - MDP
Markov decision processes S- set of states A-
set of actions d - Transition probability R -
Reward function
Similar to DFA!
9
MDP model - states and actions
Environment states
0.7
action a
0.3
Actions transitions
10
MDP model - rewards
R(s,a) reward at state s for
doing action a (a random variable).
Example R(s,a) -1 with probability 0.5
10 with probability 0.35
20 with probability 0.15
11
MDP model - trajectories
12
MDP - Return function.
Combining all the immediate rewards to a single
value. Modeling Issues Are early rewards more
valuable than later rewards? Is the system
terminating or continuous?
Usually the return is linear in the immediate
rewards.
13
MDP model - return functions
Finite Horizon - parameter H
Infinite Horizon
discounted - parameter glt1.
undiscounted
Terminating MDP
14
MDP model - action selection
AIM Maximize the expected return.
Fully Observable - can see the entire state.
Policy - mapping from states to actions
Optimal policy optimal from any start state.
THEOREM There exists a deterministic optimal
policy
15
Contrast with Supervised Learning
Supervised Learning Fixed distribution on
examples.
Reinforcement Learning The state distribution is
policy dependent!!! A small local change in the
policy can make a huge global change in the
return.
16
MDP model - summary
- set of states, Sn. - set of k actions,
Ak. - transition function. - immediate reward
function. - policy. - discounted cumulative
return.
R(s,a)
17
Simple example N- armed bandit
Single state.
Goal Maximize sum of immediate rewards.
a1
s
Given the model Greedy action.
a2
Difficulty unknown model.
a3
18
N-Armed Bandit Highlights
  • Algorithms (near greedy)
  • Exponential weights
  • Gi sum of rewards of action ai
  • wi eGi
  • Follow the leader
  • Results
  • For any sequence of T rewards
  • Eonline gt maxi Gi - sqrtT log N

19
Planning - Basic Problems.
Given a complete MDP model.
Policy evaluation - Given a policy p, estimate
its return.
Optimal control -
Find an optimal policy p (maximizes the return
from any start state).
20
Planning - Value Functions
Vp(s) The expected return starting at state s
following p.
Qp(s,a) The expected return starting at state s
with action a and then following p.
V(s) and Q(s,a) are define using an optimal
policy p.
V(s) maxp Vp(s)
21
Planning - Policy Evaluation
Discounted infinite horizon (Bellman Eq.)
Vp(s) Es p (s) R(s,p (s)) g Vp(s)
Linear system of equations.
22
Algorithms - Policy Evaluation Example
A1,-1 g 1/2 d(si,a) sia p random
s0
s1
0
1
2
3
"a R(si,a) i
s2
s3
Vp(s0) 0 g p(s0,1)Vp(s1) p(s0,-1) Vp(s3)
23
Algorithms -Policy Evaluation Example
Vp(s0) 5/3 Vp(s1) 7/3 Vp(s2) 11/3 Vp(s3)
13/3
A1,-1 g 1/2 d(si,a) sia p random
s0
s1
0
1
"a R(si,a) i
2
3
s2
s3
Vp(s0) 0 (Vp(s1) Vp(s3) )/4
24
Algorithms - optimal control
State-Action Value function
Qp(s,a) E R(s,a) gEs (s,a) Vp(s)
Note
For a deterministic policy p.
25
Algorithms -Optimal control Example
Qp(s0,1) 5/6 Qp(s0,-1) 13/6
A1,-1 g 1/2 d(si,a) sia p random
s0
s1
0
1
R(si,a) i
2
3
s2
s3
Qp(s0,1) 0 g Vp(s1)
26
Algorithms - optimal control
CLAIM A policy p is optimal if and only if at
each state s
Vp(s) MAXa Qp(s,a) (Bellman Eq.)
PROOF Assume there is a state s and action a
s.t.,
Vp(s) lt Qp(s,a).
Then the strategy of performing a at state s (the
first time) is better than p.
This is true each time we visit s, so the policy
that performs action a at state s is better than
p.
p
27
Algorithms -optimal control Example
A1,-1 g 1/2 d(si,a) sia p random
s0
s1
0
1
R(si,a) i
2
3
s2
s3
Changing the policy using the state-action value
function.
28
Algorithms - optimal control
The greedy policy with respect to Qp(s,a) is
p(s) argmaxaQp(s,a)
The e-greedy policy with respect to Qp(s,a) is
p(s) argmaxaQp(s,a) with probability
1-e, and p(s) random action with
probability e
29
MDP - computing optimal policy
1. Linear Programming 2. Value Iteration method.
3. Policy Iteration method.
30
Convergence
  • Value Iteration
  • Drop in distance from optimal
  • maxs V(s) Vt(s)
  • Policy Iteration
  • Policy can only improve
  • ?s Vt1(s) ? Vt(s)
  • Less iterations then Value Iteration, but
  • more expensive iterations.

31
Relations to Board Games
  • state current board
  • action what we can play.
  • opponent action part of the environment
  • value function probability of winning
  • Q- function modified policy.
  • Hidden assumption Game is Markovian

32
Planning versus Learning
Tightly coupled in Reinforcement Learning
Goal maximize return while learning.
33
Example - Elevator Control
Learning (alone) Model the arrival model
well.
Planning (alone) Given arrival model build
schedule
Real objective Construct a schedule while
updating model
34
Partially Observable MDP
Rather than observing the state we observe some
function of the state.
Ob - Observable function. a random
variable for each states.
Example (1) Ob(s) snoise. (2) Ob(s) first
bit of s.
Problem different states may look similar.
The optimal strategy is history dependent !
35
POMDP - Belief State Algorithm
Given a history of actions and observable
value we compute a posterior distribution for the
state we are in (belief state).
The belief-state MDP
States distribution over S (states of the
POMDP). actions as in the POMDP. Transition the
posterior distribution (given the observation)
We can perform the planning and learning on the
belief-state MDP.
36
POMDP - Hard computational problems.
Computing an infinite (polynomial) horizon
undiscounted optimal strategy for a deterministic
POMDP is P-space-hard (NP-complete) PT,L.
Computing an infinite (polynomial) horizon
undiscounted optimal strategy for a stochastic
POMDP is EXPTIME-hard (P-space-complete) PT,L.
Computing an infinite (polynomial) horizon
undiscounted optimal policy for an MDP is
P-complete PT .
37
Resources
  • Reinforcement Learning (an introduction) Sutton
    Barto
  • Markov Decision Processes Puterman
  • Dynamic Programming and Optimal Control
    Bertsekas
  • Neuro-Dynamic Programming Bertsekas
    Tsitsiklis
  • Ph. D. thesis - Michael Littman
Write a Comment
User Comments (0)