Introduction to Radial Basis Function Networks

About This Presentation

Title:

Introduction to Radial Basis Function Networks

Description:

Properties of RBF's. On-Center, Off Surround ... visual cortex; ganglion cells. The Topology of RBF. Feature Vectors. x1. x2. xn. y1. ym. Inputs ... – PowerPoint PPT presentation

Number of Views:101

Avg rating:3.0/5.0

Slides: 138

Provided by: taiwe

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Radial Basis Function Networks

1
Introduction to Radial Basis Function Networks

??? ???

2
Content

Overview
The Models of Function Approximator
The Radial Basis Function Networks
RBFNs for Function Approximation
The Projection Matrix
Learning the Kernels
Bias-Variance Dilemma
The Effective Number of Parameters
Model Selection
Incremental Operations

3
Introduction to Radial Basis Function Networks

Overview

4
Typical Applications of NN

Pattern Classification
Function Approximation
Time-Series Forecasting

5
Function Approximation
Unknown
Approximator
6
Supervised Learning
Unknown Function
Neural Network
7
Neural Networks as Universal Approximators

Feedforward neural networks with a single hidden
layer of sigmoidal units are capable of
approximating uniformly any continuous
multivariate function, to any desired degree of
accuracy.
Hornik, K., Stinchcombe, M., and White, H.
(1989). "Multilayer Feedforward Networks are
Universal Approximators," Neural Networks, 2(5),
359-366.
Like feedforward neural networks with a single
hidden layer of sigmoidal units, it can be shown
that RBF networks are universal approximators.
Park, J. and Sandberg, I. W. (1991). "Universal
Approximation Using Radial-Basis-Function
Networks," Neural Computation, 3(2), 246-257.
Park, J. and Sandberg, I. W. (1993).
"Approximation and Radial-Basis-Function
Networks," Neural Computation, 5(2), 305-316.

8
Statistics vs. Neural Networks
Statistics Neural Networks
model network
estimation learning
regression supervised learning
interpolation generalization
observations training set
parameters (synaptic) weights
independent variables inputs
dependent variables outputs
ridge regression weight decay
9
Introduction to Radial Basis Function Networks

The Model of
Function Approximator

10
Linear Models
Weights
Fixed Basis Functions
11
Linear Models
Linearly weighted output
Output Units

Decomposition
Feature Extraction
Transformation

Hidden Units
Inputs
Feature Vectors
12
Linear Models
Can you say some bases?
y
Linearly weighted output
Output Units
w2
w1
wm

Decomposition
Feature Extraction
Transformation

Hidden Units
?1
?2
?m
Inputs
Feature Vectors
x1
x2
xn
x
13
Example Linear Models
Are they orthogonal bases?

Polynomial
Fourier Series

14
Single-Layer Perceptrons as Universal
Aproximators
With sufficient number of sigmoidal units, it can
be a universal approximator.
Hidden Units
15
Radial Basis Function Networks as Universal
Aproximators
With sufficient number of radial-basis-function
units, it can also be a universal approximator.
Hidden Units
16
Non-Linear Models
Weights
Adjusted by the Learning process
17
Introduction to Radial Basis Function Networks

The Radial Basis Function Networks

18
Radial Basis Functions
Three parameters for a radial function
?i(x)? (x ? xi)
xi

Center
Distance Measure
Shape

r x ? xi
?
19
Typical Radial Functions

Gaussian
Hardy Multiquadratic
Inverse Multiquadratic

20
Gaussian Basis Function (?0.5,1.0,1.5)
21
Inverse Multiquadratic
c5
c4
c3
c2
c1
22
Most General RBF
Basis ?i i 1,2, is near orthogonal.
23
Properties of RBFs

On-Center, Off Surround
Analogies with localized receptive fields found
in several biological structures, e.g.,
visual cortex
ganglion cells

24
The Topology of RBF
As a function approximator
Output Units
Interpolation
Hidden Units
Projection
Feature Vectors
Inputs
25
The Topology of RBF
As a pattern classifier.
Output Units
Classes
Hidden Units
Subclasses
Feature Vectors
Inputs
26
Introduction to Radial Basis Function Networks

RBFNs for
Function Approximation

27
The idea
y
x
28
The idea
y
x
29
The idea
y
x
30
The idea
y
x
31
The idea
y
x
32
Radial Basis Function Networks as Universal
Aproximators
Training set
Goal
for all k
33
Learn the Optimal Weight Vector
Training set
Goal
for all k
34
Regularization
Training set
If regularization is unneeded, set
Goal
for all k
35
Learn the Optimal Weight Vector
Minimize
36
Learn the Optimal Weight Vector
Define
37
Learn the Optimal Weight Vector
Define
38
Learn the Optimal Weight Vector
39
Learn the Optimal Weight Vector
Design Matrix
Variance Matrix
40
Summary
Training set
41
Introduction to Radial Basis Function Networks

The Projection Matrix

42
The Empirical-Error Vector
43
The Empirical-Error Vector
Error Vector
44
Sum-Squared-Error
If ?0, the RBFNs learning algorithm is to
minimize SSE (MSE).
Error Vector
45
The Projection Matrix
Error Vector
46
Introduction to Radial Basis Function Networks

Learning the Kernels

47
RBFNs as Universal Approximators
Training set
Kernels
48
What to Learn?

Weights wijs
Centers ?js of ?js
Widths ?js of ?js
Number of ?js ? Model Selection

49
One-Stage Learning
50
One-Stage Learning
The simultaneous updates of all three sets of
parameters may be suitable for non-stationary
environments or on-line setting.
51
Two-Stage Training
Step 2
Determines wijs.
E.g., using batch-learning.
Step 1

Determines
Centers ?js of ?js.
Widths ?js of ?js.
Number of ?js.

52
Train the Kernels
53
Unsupervised Training
54
Methods

Subset Selection
Random Subset Selection
Forward Selection
Backward Elimination
Clustering Algorithms
KMEANS
LVQ
Mixture Models
GMM

55
Subset Selection
56
Random Subset Selection

Randomly choosing a subset of points from
training set
Sensitive to the initially chosen points.
Using some adaptive techniques to tune
Centers
Widths
points

57
Clustering Algorithms
Partition the data points into K clusters.
58
Clustering Algorithms
Is such a partition satisfactory?
59
Clustering Algorithms
How about this?
60
Clustering Algorithms
?1

?2

?4

?3
61
Introduction to Radial Basis Function Networks

Bias-Variance Dilemma

62
Goal Revisit

Ultimate Goal ? Generalization

Minimize Prediction Error

Goal of Our Learning Procedure

Minimize Empirical Error
63
Badness of Fit

Underfitting
A model (e.g., network) that is not sufficiently
complex can fail to detect fully the signal in a
complicated data set, leading to underfitting.
Produces excessive bias in the outputs.
Overfitting
A model (e.g., network) that is too complex may
fit the noise, not just the signal, leading to
overfitting.
Produces excessive variance in the outputs.

64
Underfitting/Overfitting Avoidance

Model selection
Jittering
Early stopping
Weight decay
Regularization
Ridge Regression
Bayesian learning
Combining networks

65
Best Way to Avoid Overfitting

Use lots of training data, e.g.,
30 times as many training cases as there are
weights in the network.
for noise-free data, 5 times as many training
cases as weights may be sufficient.
Dont arbitrarily reduce the number of weights
for fear of underfitting.

66
Badness of Fit
Underfit
Overfit
67
Badness of Fit
Underfit
Overfit
68
Bias-Variance Dilemma
However, it's not really a dilemma.
Underfit
Overfit
Large bias
Small bias
Small variance
Large variance
69
Bias-Variance Dilemma

More on overfitting
Easily lead to predictions that are far beyond
the range of the training data.
Produce wild predictions in multilayer
perceptrons even with noise-free data.

70
Bias-Variance Dilemma
However, it's not really a dilemma.
71
Bias-Variance Dilemma
The mean of the bias?
The variance of the bias?
The true model
bias
bias
bias
E.g., depend on hidden nodes used.
72
Bias-Variance Dilemma
The mean of the bias?
The variance of the bias?
Variance
The true model
E.g., depend on hidden nodes used.
73
Model Selection
Reduce the effective number of parameters.
Reduce the number of hidden nodes.
Variance
The true model
E.g., depend on hidden nodes used.
74
Bias-Variance Dilemma
Goal
The true model
E.g., depend on hidden nodes used.
75
Bias-Variance Dilemma
Goal
Goal
0
constant
76
Bias-Variance Dilemma
0
77
Bias-Variance Dilemma
Goal
bias2
variance
Minimize both bias2 and variance
noise
Cannot be minimized
78
Model Complexity vs. Bias-Variance
Goal
bias2
variance
noise
Model Complexity (Capacity)
79
Bias-Variance Dilemma
Goal
bias2
variance
noise
80
Example (Polynomial Fits)
81
Example (Polynomial Fits)
82
Example (Polynomial Fits)
Degree 1
Degree 5
Degree 10
Degree 15
83
Introduction to Radial Basis Function Networks

The Effective Number of Parameters

84
Variance Estimation
Mean
Variance
85
Variance Estimation
Mean
Variance
Loss 1 degree of freedom
86
Simple Linear Regression
87
Simple Linear Regression
Minimize
88
Mean Squared Error (MSE)
Minimize
Loss 2 degrees of freedom
89
Variance Estimation
Loss m degrees of freedom
m parameters of the model
90
The Number of Parameters
m
degrees of freedom
91
The Effective Number of Parameters (?)
The projection Matrix
92
The Effective Number of Parameters (?)
Facts
Pf)
The projection Matrix
93
Regularization
The effective number of parameters
Penalize models with large weights
SSE
94
Regularization
The effective number of parameters
Without penalty (?i0), there are m degrees of
freedom to minimize SSE (Cost). The effective
number of parameters ? m.
Penalize models with large weights
SSE
95
Regularization
The effective number of parameters
With penalty (?igt0), the liberty to minimize SSE
will be reduced. The effective number of
parameters ? ltm.
Penalize models with large weights
SSE
96
Variance Estimation
The effective number of parameters
Loss ? degrees of freedom
97
Variance Estimation
The effective number of parameters
98
Introduction to Radial Basis Function Networks

Model Selection

99
Model Selection

Goal
Choose the fittest model
Criteria
Least prediction error
Main Tools (Estimate Model Fitness)
Cross validation
Projection matrix
Methods
Weight decay (Ridge regression)
Pruning and Growing RBFNs

100
Empirical Error vs. Model Fitness

Ultimate Goal ? Generalization

Minimize Prediction Error

Goal of Our Learning Procedure

Minimize Empirical Error
(MSE)
Minimize Prediction Error
101
Estimating Prediction Error

When you have plenty of data use independent test
sets
E.g., use the same training set to train
different models, and choose the best model by
comparing on the test set.
When data is scarce, use
Cross-Validation
Bootstrap

102
Cross Validation

Simplest and most widely used method for
estimating prediction error.
Partition the original set into several different
ways and to compute an average score over the
different partitions, e.g.,
K-fold Cross-Validation
Leave-One-Out Cross-Validation
Generalize Cross-Validation

103
K-Fold CV

Split the set, say, D of available input-output
patterns into k mutually exclusive subsets, say
D1, D2, , Dk.
Train and test the learning algorithm k times,
each time it is trained on D\Di and tested on Di.

104
Leave-One-Out CV
A special case of k-fold CV.

Split the p available input-output patterns into
a training set of size p?1 and a test set of size
1.
Average the squared error on the left-out pattern
over the p possible ways of partition.

105
Error Variance Predicted by LOO
A special case of k-fold CV.
The estimate for the variance of prediction error
using LOO
Error-square for the left-out element.
106
Error Variance Predicted by LOO
A special case of k-fold CV.
Given a model, the function with least empirical
error for Di.
As an index of models fitness. We want to find a
model also minimize this.
The estimate for the variance of prediction error
using LOO
Error-square for the left-out element.
107
Error Variance Predicted by LOO
A special case of k-fold CV.
Are there any efficient ways?
How to estimate?
The estimate for the variance of prediction error
using LOO
Error-square for the left-out element.
108
Error Variance Predicted by LOO
Error-square for the left-out element.
109
Generalized Cross-Validation
110
More Criteria Based on CV
GCV (Generalized CV)
Akaikes Information Criterion
UEV (Unbiased estimate of variance)
FPE (Final Prediction Error)
BIC (Bayesian Information Criterio)
111
More Criteria Based on CV
112
More Criteria Based on CV
113
Regularization
Standard Ridge Regression,
Penalize models with large weights
SSE
114
Regularization
Standard Ridge Regression,
Penalize models with large weights
SSE
115
Solution Review
Used to compute model selection criteria
116
Example
Width of RBF r 0.5
117
Example
Width of RBF r 0.5
118
Example
Width of RBF r 0.5
How the determine the optimal regularization
parameter effectively?
119
Optimizing the Regularization Parameter
Re-Estimation Formula
120
Local Ridge Regression
Re-Estimation Formula
121
Example
Width of RBF
122
Example
Width of RBF
123
Example
Width of RBF
There are two local-minima.
Using the about re-estimation formula, it will be
stuck at the nearest local minimum.
That is, the solution depends on the initial
setting.
124
Example
Width of RBF
There are two local-minima.
125
Example
Width of RBF
There are two local-minima.
126
Example
Width of RBF
RMSE Root Mean Squared Error In real case, it is
not available.
127
Example
Width of RBF
RMSE Root Mean Squared Error In real case, it is
not available.
128
Local Ridge Regression
Standard Ridge Regression
Local Ridge Regression
129
Local Ridge Regression
Standard Ridge Regression
?j ?? implies that ?j(?) can be removed.
Local Ridge Regression
130
The Solutions
Used to compute model selection criteria
131
Optimizing the Regularization Parameters
Incremental Operation
P The current projection Matrix.
Pj The projection Matrix obtained by removing
?j(?).
132
Optimizing the Regularization Parameters
Solve
Subject to
133
Optimizing the Regularization Parameters
Solve
Subject to
134
Optimizing the Regularization Parameters
Remove ?j(?)
Solve
Subject to
135
The Algorithm

Initialize ?is.
e.g., performing standard ridge regression.
Repeat the following until GCV converges
Randomly select j and compute
Perform local ridge regression
If GCV reduce remove ?j(?)

136
References
Mark J. L. Orr (April 1996), Introduction to
Radial Basis Function Networks,
http//www.anc.ed.ac.uk/mjo/intro/intro.html.
Kohavi, R. (1995), "A study of cross-validation
and bootstrap for accuracy estimation and model
selection," International Joint Conference on
Artificial Intelligence (IJCAI).
137
Introduction to Radial Basis Function Networks