Title: Giansalvo EXIN Cirrincione
1unit 6
Neural Networks and Pattern Recognition
Giansalvo EXIN Cirrincione
2Radial basis functions
Exact interpolation
The exact interpolation problem requires every
input vector to be mapped exactly onto the
corresponding target vector.
Problem given a mapping x ??d ? t ?? and a TS
of N points, find a function h(x) such that
h(x n) t n
n 1, , N
generalized linear discriminant function
3Radial basis functions
Exact interpolation
Many properties of the interpolating function are
relatively insensitive to the precise form of the
non-linear kernel.
thin-plate spline function
multi-quadric function for b 1/2
non-linear in the components of x
4Radial basis functions
Exact interpolation
Gaussian basis functions with s 0.067 (roughly
twice the spacing of the data points)
highly oscillatory
5Radial basis functions
Exact interpolation
6RBF
RBFs provide a smooth interpolating function in
which the number of basis functions is determined
by the complexity of the mapping to be
represented rather than by the size of the data
set.
The number M of basis functions is less than N.
The centres of the basis functions are determined
by the training process.
Each basis function has its own width sj which is
adaptive.
Bias parameters are included in the linear sum
and compensate for the difference between the
average value over the TS of the basis function
activations and the corresponding average value
of the targets.
7RBF
mMd
8RBF
There is a trade-off between using a smaller
number of basis with many adjustable parameters
and a larger number of less flexible functions
9RBF
There is a trade-off between using a smaller
number of basis with many adjustable parameters
and a larger number of less flexible functions
Suppose that all of the Gaussian basis functions
in the network share a common covariance matrix
S. Show that the mapping represented by such a
network is equivalent to that of a network of
spherical Gaussian basis functions with common
variance s2 1, provided the input vector x is
first transformed by an appropriate linear
transformation. Find expressions relating the
transformed input vector and kernel centres to
the corresponding original vectors.
10RBF training
decoupling
fast learning
11RBF training
12RBF training
- M 5
- centres random subset of the TS
13Regularization theory
Set the functional derivative w.r.t. y(x) to zero
Euler-Lagrange equations
If P is translationally and rotationally
invariant, G depends only on the distance between
x and x (radial). The solution is given by
RBF
14Regularization theory
15Regularization theory
Integrate over a small region around xn
Using the solution in terms of the Greens
functions
The Greens function is Gaussian with width
parameter s if the operator P is chosen as
n 0 implies RB exact interpolation
16Regularization theory
- s 0.067
- n 40
- RBs centred on each data
n 0 implies RB exact interpolation
17homework
18Consider the functional derivative of the
regularization functional (w.r.t. y(x)) given by
By using successive integration by parts, and
making use of identities
show that the operator
is given by
It should be assumed that boundary terms arising
from the integration by parts can be neglected.
Now find the radial Greens function of this
operator as follows. First introduce the
multidimensional Fourier transform of G in the
form
19By using the last two formulae and using the
following formula for the Fourier transform of
the Dirac function
where d is the dimensionality of x and s, show
that the Fourier transform of the Greens
function is given by
Now substitute this result into the inverse
Fourier transform of G and then show that the
Greens function is given by
20Regularization theory
Regularization can also be applied to general
RBFs. Also, regularization terms can be
considered for which the kernels are not
necessarily the Greens functions. For example
penalizes mappings which have large curvatures.
This regularizer leads to second-layer weights
which are found by the solution of
21Relation to kernel regression
Technique for estimating regression functions
from noisy data, based on methods of kernel
density estimation
Consider a mapping x ? y and a corresponding
TS a complete description of the statistical
properties of the generator of the data is given
by the probability density p(x,t) in the joint
input-target space.
22Relation to kernel regression
The optimal mapping is given by forming the
regression, or conditional average ?t?x?, of the
target data, conditioned on the input variables.
23Relation to kernel regression
The optimal mapping is given by forming the
regression, or conditional average ?t?x?, of the
target data, conditioned on the input variables.
Nadaraya-Watson estimator
24Relation to kernel regression
Extension replace the kernel estimator with an
adaptive mixture model
Nadaraya-Watson estimator
25RBFs for classification
Model the class distributions by local kernel
functions
26RBFs for classification
The outputs represent approximations to the
posterior probabilities
RBF
27RBFs for classification
Use a common pool of M basis functions, labelled
by an index j, to represent all of the
class-conditional densities
28RBFs for classification
The activations of the basis functions can be
interpreted as the posterior probabilities of the
presence of corresponding features in the input
space, and the weights can similarly be
interpreted as the posterior probabilities of
class membership, given the presence of the
features. The outputs represent the posterior
probabilities of class membership.
29Comparison with the multilayer perceptron
The hidden unit activation in a MLP is constant
on parallel (d-1)-dimensional hyperplanes.
The hidden unit (RB) activation in a RBF is
constant on concentric (d-1)-dimensional
hyperspheres (more generally hyperellipsoids).
In a MLP a hidden unit has a constant activation
function for input vectors which lie on a
hyperplanar surface in input space given by
wTxw0const., while for a spherical RBF a hidden
unit has constant activation on a hyperspherical
surface defined by x- ?2 const.. Show that,
for suitable choices of the parameters, these
surfaces coincide if the input vectors are
normalized to unit length. Illustrate this
equivalence geometrically for vectors in a
three-dimensional input space.
30Comparison with the multilayer perceptron
The hidden unit activation in a MLP is constant
on parallel (d-1)-dimensional hyperplanes.
The hidden unit (RB) activation in a RBF is
constant on concentric (d-1)-dimensional
hyperspheres (more generally hyperellipsoids).
The MLP forms a distributed representation in the
space of activation values for the hidden units
(problems local minima, flat regions).
The RBF forms a representation in the space of
activation values for the hidden units which is
local w.r.t. the input space.
All of the parameters in a MLP are usually
determined at the same time as part of a single
global supervised training strategy.
RBF is trained in two steps (kernels determined
first by unsupervised methods, weights then found
by fast linear supervised methods).
31basis function optimization
ignore any target information
The basis function parameters should be chosen to
form a representation of the pdf of the input
data.
?js as prototypes of the inputs
Problem if the basis function centres are used
to fill out a compact d-dimensional region of the
input space, then the number of kernel centres
will be an exponential function of d.
Problem input variables which have significant
variance but play little role in determining the
appropriate output variables (irrelevant inputs).
32basis function optimization
y independent of x2
Problem input variables which have significant
variance but play little role in determining the
appropriate output variables (irrelevant inputs).
33basis function optimization
A MLP can learn to ignore irrelevant inputs and
obtain accurate results with a small number of
hidden units.
y independent of x2
Problem input variables which have significant
variance but play little role in determining the
appropriate output variables (irrelevant inputs).
34basis function optimization
The optimal choice of basis function
parameters for input density estimation needs not
be optimal for representing the mapping of the
output variables.
35basis function optimization
subsets of data points
basis function centres
set them equal to a random subset of TS data (use
as initial conditions)
start with all data points as kernel centres and
then selectively remove centres in such a way as
to have minimum disruption on the system
performance
- very fast
- significantly sub-optimal
36basis function optimization
K-means (basic ISODATA) clustering algorithm
Suppose there are N data points xn in total it
tries to find a set of K representatives vectors
?j where j 1, ... ,k.
It seeks to partition the data points into K
disjoint subsets Sj containing Nj data points in
such a way as to minimize the sum-of-squares
clustering function given by
where ?j is the mean of the data points in set Sj
and is given by
37basis function optimization
K-means (basic ISODATA) clustering algorithm
batch (Lloyd, 1982)
It begins by assigning the points at random to K
sets and then computing the mean vectors of the
points in each set. Next, each point is
re-assigned to a new set according to which is
the nearest mean vector. The means of the sets
are then recomputed. This procedure is repeated
until there is no further change in the grouping
of the data points.
At each such iteration, the value of J will not
increase.
38basis function optimization
K-means (basic ISODATA) clustering algorithm
batch (Lloyd, 1982)
39basis function optimization
Robbins-Monro for finding the root of a
regression function given by the derivative of J
w.r.t. ?j
K-means (basic ISODATA) clustering algorithm
sequential (MacQueen, 1967)
The initial centres are randomly chosen from the
data points, and as each data point xn is
presented, the nearest ?j is updated using
learning rate
Once the centres of the kernels have been found,
the covariance matrices of the kernels can be set
to the covariances of the points assigned to the
corresponding clusters.
40homework
41Write a numerical implementation of the K-means
clustering algorithm using both the batch and
on-line versions. Illustrate the operation of the
algorithm by generating data sets in two
dimensions from a mixture of Gaussian
distributions, and plotting the data points
together with the trajectories of the estimated
means during the course of the algorithm.
Investigate how the results depend on the value
of K in relation to the number of Gaussian
distributions, and how they depend on the
variances of the distributions in relation to
their separation. Study the performance of the
on-line version of the algorithm for different
values of the learning rate parameter ? and
compare the algorithm with the batch version.
42Gaussian mixture models
The basis functions of the RBF can be regarded as
the components of a mixture density model whose
parameters are to be optimized by ML.
Once the mixture model has been optimized, the
mixing coefficients P(j) can be discarded, and
the basis functions then used in the RBF in which
the second-layer weights are found by supervised
training.
43K-means as a particular limit of the EM
optimization of a Gaussian mixture model
44basis function optimization (supervised training)
The basis function parameters for regression can
be found by treating the kernel centres and
widths, along with the second-layer weights, as
adaptive parameters to be determined by
minimization of an error function.
- sum-of-squares error
- spherical Gaussian basis functions
non-linear computationally intensive optimization
problem
45basis function optimization (supervised training)
RBF training
well localized kernels
46basis function optimization (supervised training)
RBF training
47basis function optimization (supervised training)
RBF training
training procedures can be speeded up
significantly by identifying the relevant kernels
and therefore avoiding unnecessary computation
48basis function optimization (supervised training)
RBF training
coarse (unsupervised) to fine (supervised)
no guarantee basis function will remain localized
!
49FINE