Chapter 8: Generalization and Function Approximation - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Chapter 8: Generalization and Function Approximation

Description:

Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good ... – PowerPoint PPT presentation

Number of Views:172

Avg rating:3.0/5.0

Slides: 31

Provided by: AndyB203

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 8: Generalization and Function Approximation

1
Chapter 8 Generalization and Function
Approximation
Objectives of this chapter

Look at how experience with a limited part of the
state set be used to produce good behavior over a
much larger part.
Overview of function approximation (FA) methods
and how they can be adapted to RL

2
Value Prediction with FA
As usual Policy Evaluation (the prediction
problem) for a given policy p, compute
the state-value function
In earlier chapters, value functions were stored
in lookup tables.
3
Adapt Supervised Learning Algorithms
Training Info desired (target) outputs
Supervised Learning System
Inputs
Outputs
Training example input, target output
Error (target output actual output)
4
Backups as Training Examples
As a training example
input
target output
5
Any Function Approximation Method?

In principle, yes
artificial neural networks
decision trees
multivariate regression methods
etc.
But RL has some special requirements
usually want to learn while interacting
ability to handle nonstationarity
other?

6
Gradient Descent Methods
transpose
7
Performance Measures

Many are applicable but
a common and simple one is the mean-squared error
(MSE) over a distribution P to weight various
errors
Why P ?
Why minimize MSE?
Let us assume that P is always the distribution
of states at which backups are done.
The on-policy distribution the distribution
created while following the policy being
evaluated. Stronger results are available for
this distribution.

8
Gradient Descent
Iteratively move down the gradient
9
Gradient Descent Cont.
For the MSE given above and using the chain rule
10
Gradient Descent Cont.
Assume that states appear with distribution P Use
just the sample gradient instead
Since each sample gradient is an unbiased
estimate of the true gradient, this converges to
a local minimum of the MSE if a decreases
appropriately with t.
11
But We Dont have these Targets
12
What about TD(l) Targets?
13
On-Line Gradient-Descent TD(l)
14
Linear Methods
15
Nice Properties of Linear FA Methods

The gradient is very simple
For MSE, the error surface is simple quadratic
surface with a single minumum.
Linear gradient descent TD(l) converges
Step size decreases appropriately
On-line sampling (states sampled from the
on-policy distribution)
Converges to parameter vector with
property

best parameter vector
(Tsitsiklis Van Roy, 1997)
16
Coarse Coding
17
Learning and Coarse Coding
18
Tile Coding

Binary feature for each tile
Number of features present at any one time is
constant
Binary features means weighted sum easy to
compute
Easy to compute indices of the features present

19
Tile Coding Cont.
Irregular tilings
Hashing
CMAC Cerebellar model arithmetic
computer Albus 1971
20
Radial Basis Functions (RBFs)
e.g., Gaussians
21
Can you beat the curse of dimensionality?

Can you keep the number of features from going up
exponentially with the dimension?
Function complexity, not dimensionality, is the
problem.
Kanerva coding
Select a bunch of binary prototypes
Use hamming distance as distance measure
Dimensionality is no longer a problem, only
complexity
Lazy learning schemes
Remember all the data
To get new value, find nearest neighbors and
interpolate
e.g., locally-weighted regression

22
Control with Function Approximation

Learning state-action values
The general gradient-descent rule
Gradient-descent Sarsa(l) (backward view)

23
GPI with Linear Gradient Descent Sarsa(l)
24
GPI Linear Gradient Descent Watkins Q(l)
25
Mountain-Car Task
26
Mountain-Car Results
27
Bairds Counterexample
Reward is always zero, so the true value is zero
for all s The approximate of state value function
is shown in each state Updating Is unstable
from some initial conditions
28
Bairds Counterexample Cont.
29
Should We Bootstrap?
30
Summary

Generalization
Adapting supervised-learning function
approximation methods
Gradient-descent methods
Linear gradient-descent methods
Radial basis functions
Tile coding
Kanerva coding
Nonlinear gradient-descent methods?
Backpropation?
Subtleties involving function approximation,
bootstrapping and the on-policy/off-policy
distinction