Reinforcement Learning

1 / 37

About This Presentation

Title:

Reinforcement Learning

Description:

discounted - parameter g 1. undiscounted. Terminating MDP ... Discounted infinite horizon (Bellman Eq.) Rewrite the expectation. Linear system of equations. ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 38

Provided by: csta3

more less

Transcript and Presenter's Notes

Title: Reinforcement Learning

1
Reinforcement Learning

Yishay Mansour
Tel-Aviv University

2
Outline

Goal of Reinforcement Learning
Mathematical Model (MDP)
Planning

3
Goal of Reinforcement Learning
Goal oriented learning through interaction Contro
l of large scale stochastic environments with
partial knowledge.
Supervised / Unsupervised Learning
Learn from labeled / unlabeled examples
4
Reinforcement Learning - origins
Artificial Intelligence Control
Theory Operation Research Cognitive Science
Psychology Solid foundations well established
research.
5
Typical Applications

Robotics
Elevator control CB.
Robo-soccer SV.
Board games
backgammon T,
checkers S.
Chess B
Scheduling
Dynamic channel allocation SB.
Inventory problems.

6
Contrast with Supervised Learning
The system has a state.
The algorithm influences the state distribution.
Inherent Tradeoff Exploration versus
Exploitation.
7
Mathematical Model - Motivation
Model of uncertainty
Environment, actions, our knowledge.
Focus on decision making.
Maximize long term reward.
Markov Decision Process (MDP)
8
Mathematical Model - MDP
Markov decision processes S- set of states A-
set of actions d - Transition probability R -
Reward function
Similar to DFA!
9
MDP model - states and actions
Environment states
0.7
action a
0.3
Actions transitions
10
MDP model - rewards
R(s,a) reward at state s for
doing action a (a random variable).
Example R(s,a) -1 with probability 0.5
10 with probability 0.35
20 with probability 0.15
11
MDP model - trajectories
12
MDP - Return function.
Combining all the immediate rewards to a single
value. Modeling Issues Are early rewards more
valuable than later rewards? Is the system
terminating or continuous?
Usually the return is linear in the immediate
rewards.
13
MDP model - return functions
Finite Horizon - parameter H
Infinite Horizon
discounted - parameter glt1.
undiscounted
Terminating MDP
14
MDP model - action selection
AIM Maximize the expected return.
Fully Observable - can see the entire state.
Policy - mapping from states to actions
Optimal policy optimal from any start state.
THEOREM There exists a deterministic optimal
policy
15
Contrast with Supervised Learning
Supervised Learning Fixed distribution on
examples.
Reinforcement Learning The state distribution is
policy dependent!!! A small local change in the
policy can make a huge global change in the
return.
16
MDP model - summary
- set of states, Sn. - set of k actions,
Ak. - transition function. - immediate reward
function. - policy. - discounted cumulative
return.
R(s,a)
17
Simple example N- armed bandit
Single state.
Goal Maximize sum of immediate rewards.
a1
s
Given the model Greedy action.
a2
Difficulty unknown model.
a3
18
N-Armed Bandit Highlights

Algorithms (near greedy)
Exponential weights
Gi sum of rewards of action ai
wi eGi
Follow the leader
Results
For any sequence of T rewards
Eonline gt maxi Gi - sqrtT log N

19
Planning - Basic Problems.
Given a complete MDP model.
Policy evaluation - Given a policy p, estimate
its return.
Optimal control -
Find an optimal policy p (maximizes the return
from any start state).
20
Planning - Value Functions
Vp(s) The expected return starting at state s
following p.
Qp(s,a) The expected return starting at state s
with action a and then following p.
V(s) and Q(s,a) are define using an optimal
policy p.
V(s) maxp Vp(s)
21
Planning - Policy Evaluation
Discounted infinite horizon (Bellman Eq.)
Vp(s) Es p (s) R(s,p (s)) g Vp(s)
Linear system of equations.
22
Algorithms - Policy Evaluation Example
A1,-1 g 1/2 d(si,a) sia p random
s0
s1
0
1
2
3
"a R(si,a) i
s2
s3
Vp(s0) 0 g p(s0,1)Vp(s1) p(s0,-1) Vp(s3)
23
Algorithms -Policy Evaluation Example
Vp(s0) 5/3 Vp(s1) 7/3 Vp(s2) 11/3 Vp(s3)
13/3
A1,-1 g 1/2 d(si,a) sia p random
s0
s1
0
1
"a R(si,a) i
2
3
s2
s3
Vp(s0) 0 (Vp(s1) Vp(s3) )/4
24
Algorithms - optimal control
State-Action Value function
Qp(s,a) E R(s,a) gEs (s,a) Vp(s)
Note
For a deterministic policy p.
25
Algorithms -Optimal control Example
Qp(s0,1) 5/6 Qp(s0,-1) 13/6
A1,-1 g 1/2 d(si,a) sia p random
s0
s1
0
1
R(si,a) i
2
3
s2
s3
Qp(s0,1) 0 g Vp(s1)
26
Algorithms - optimal control
CLAIM A policy p is optimal if and only if at
each state s
Vp(s) MAXa Qp(s,a) (Bellman Eq.)
PROOF Assume there is a state s and action a
s.t.,
Vp(s) lt Qp(s,a).
Then the strategy of performing a at state s (the
first time) is better than p.
This is true each time we visit s, so the policy
that performs action a at state s is better than
p.
p
27
Algorithms -optimal control Example
A1,-1 g 1/2 d(si,a) sia p random
s0
s1
0
1
R(si,a) i
2
3
s2
s3
Changing the policy using the state-action value
function.
28
Algorithms - optimal control
The greedy policy with respect to Qp(s,a) is
p(s) argmaxaQp(s,a)
The e-greedy policy with respect to Qp(s,a) is
p(s) argmaxaQp(s,a) with probability
1-e, and p(s) random action with
probability e
29
MDP - computing optimal policy
1. Linear Programming 2. Value Iteration method.
3. Policy Iteration method.
30
Convergence

Value Iteration
Drop in distance from optimal
maxs V(s) Vt(s)
Policy Iteration
Policy can only improve
?s Vt1(s) ? Vt(s)
Less iterations then Value Iteration, but
more expensive iterations.

31
Relations to Board Games

state current board
action what we can play.
opponent action part of the environment
value function probability of winning
Q- function modified policy.
Hidden assumption Game is Markovian

32
Planning versus Learning
Tightly coupled in Reinforcement Learning
Goal maximize return while learning.
33
Example - Elevator Control
Learning (alone) Model the arrival model
well.
Planning (alone) Given arrival model build
schedule
Real objective Construct a schedule while
updating model
34
Partially Observable MDP
Rather than observing the state we observe some
function of the state.
Ob - Observable function. a random
variable for each states.
Example (1) Ob(s) snoise. (2) Ob(s) first
bit of s.
Problem different states may look similar.
The optimal strategy is history dependent !
35
POMDP - Belief State Algorithm
Given a history of actions and observable
value we compute a posterior distribution for the
state we are in (belief state).
The belief-state MDP
States distribution over S (states of the
POMDP). actions as in the POMDP. Transition the
posterior distribution (given the observation)
We can perform the planning and learning on the
belief-state MDP.
36
POMDP - Hard computational problems.
Computing an infinite (polynomial) horizon
undiscounted optimal strategy for a deterministic
POMDP is P-space-hard (NP-complete) PT,L.
Computing an infinite (polynomial) horizon
undiscounted optimal strategy for a stochastic
POMDP is EXPTIME-hard (P-space-complete) PT,L.
Computing an infinite (polynomial) horizon
undiscounted optimal policy for an MDP is
P-complete PT .
37
Resources