WK4 - PowerPoint PPT Presentation

About This Presentation

Title:

WK4

Description:

WK4 Radial Basis Function Networks CS 476: Networks of Neural Computation WK4 Radial Basis Function Networks Dr. Stathis Kasderidis Dept. of Computer Science – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 41

Provided by: geo92

Category:

more less

Transcript and Presenter's Notes

Title: WK4

1
WK4 Radial Basis Function Networks
CS 476 Networks of Neural Computation WK4
Radial Basis Function Networks Dr. Stathis
Kasderidis Dept. of Computer Science University
of Crete Spring Semester, 2009
2
Contents

Introduction to Time Series Analysis
Prediction Problem
Predicting Time Series with Neural Networks
Radial Basis Function Network
Conclusions

Contents
3
Introduction to Time Series Analysis

There are two major classes of statistical
problems
Classification problems (given an input x find in
which of a set of K known classes it belongs to)
Regression problems (try to build a functional
relationship between independent and regressed
variables. The former are the effects, while the
latter are the the causes).
The regression problems are created due to the
need for
Explanation
Prediction
Control

Time Series
4
Introduction to Time Series Analysis II

In a regression problem, there are two high-level
issues to determine
The nature of the mechanism that generates the
data (stochastic or deterministic). This affects
which class of models he will use use
A modelling procedure.

Time Series
5
Introduction to Time Series Analysis III

A modelling procedure includes usually the
following steps
Specification of a model
If it describes a function or a probability
distribution
If it is linear or non-linear
If it is parametric or non-parametric
If it is a mixture or a single function
It it includes time explicitly or not
It it include memory or not.

Time Series
6
Introduction to Time Series Analysis IV

Preparation of the data
Noise reduction
Scaling
Appropriate representation for the target
problem
Transformations
De-correlation (cleaning up spatial or temporal
correlation structure)
Feature extraction
Handling missing values

Time Series
7
Introduction to Time Series Analysis V

An estimation procedure (i.e. a framework to
estimate the model parameters)
Maximum Likelihood estimation
Bayesian estimation
(Ordinary) Least Squares
Numerical Techniques used in the estimation
framework are
Optimisation
Integration
Graph-Theoretic methods
etc

Time Series
8
Introduction to Time Series Analysis VI

Availability of data
Enough in number
Quality
Resolution.
Resulting estimators created by the framework
must be
Un-biased (i.e. do not systematically differ from
the true model in a statistical sense)
Consistent (i.e. as the number of data grows the
estimator approaches the true model with
probability 1).

Time Series
9
Introduction to Time Series Analysis VII

A model selection procedure (i.e. to select the
best model). Factors include
Goodness of Fit (i.e. how well fitted first the
given data)
Generalisation (i.e. how well approaches the
underlying data generation mechanism)
Confidence Intervals.

Time Series
10
Introduction to Time Series Analysis VIII

Testing a model
Testing the model in out of sample data
Re-iterate the modelling procedure until we
produce a model with which we are satisfied
Compare different classes of models in order to
find the best one
Usually we select the simplest class which
describes well the data
There is not always available a comparison
framework among different classes of models.
Neural Networks are semi-parametric, non-linear
statistical modelling techniques

Time Series
11
The Prediction Problem

Def A time series, Xt, is a family of
real-valued random variables indexed by t. The
index t can take values in ? or ?.
When a family of variables is defined in all
points in time it is called continuous, otherwise
it is called discrete.
In practice we have always a discrete series due
to discrete sampling times of a continuous series
or due to digitization.
The length of a series is the time elapsed
between the recoded start and finish of the
series.

Prediction
12
The Prediction Problem II

Def A time series, Xt, is called (strictly)
stationary if, for any t1, t2,, tn ? I, any k ?
I and n1,2,
Where P denotes the joint distribution function
of the set of random variables which appear as
suffices and I is an appropriate indexing set.
Broadly speaking a time series is stationary if
there is no systematic change in mean, if there
is no systematic change in variance, and if
strictly periodic variations have been removed.

Prediction
13
The Prediction Problem III

In classical time series analysis we decompose a
time series to the following components
A trend (a long term movement)
Fluctuations about the trend of grater or less
regularity
A seasonal component
A residual (irregular or random effect).
Typically probability theory of time series
examines stationary series and investigates
residuals for further structure. However, in
other cases we may be interested in capturing the
trend (i.e. function approximation).

Prediction
14
The Prediction Problem IV

It is assumed that if the residuals do not
contain any further structure, then they behave
like an IID (identical and independent
distributed) process which usually is assumed to
be the normal. Such a stochastic process cannot
be modelled further, thus the analysis of a time
series terminates
If on the other hand the series contains more
structure, we re-iterate the analysis until the
residuals do not contain any structure.
Tests to use for checking the normality of the
residuals are
Kolmogorov-Smirnov test
BDS test, etc

Prediction
15
The Prediction Problem V

If the structure of the series is linear then we
fit a linear model such as ARMA, or if it is
non-stationary we fit the ARIMA model.
On the other hand for non-linear models we use
the ARCH, GARCH and neural network models.
Typically we fit first the linear component with
a linear model and then the residuals with a
non-linear model.

Prediction
16
The Prediction Problem VI

Usually a time series does not have all the
desirable statistical properties so we transform
it in order to achieve better results before we
start the analysis. Typical transforms include
Stabilise the variance
Make seasonal effects additive
Make the data normally distributed
Filtering (FFT, moving averages, exponential
smoothing, low and high-pass filters, etc)
Differencing (the preferred method for
de-trending. We apply differencing until the time
series becomes stationary).

Prediction
17
The Prediction Problem VII

Restating the prediction problem
We want to construct a model with an appropriate
technique, which when is estimated can give
'good' forecasts in new data. The new data
commonly are some future values of the series. We
want the model
to predict as accurately as possible the future
values of the time series, given as input some
previous values of the series.

Prediction
18
The Prediction Problem VIII

There are three main approaches which are used to
model the series prediction problem
A. Assume a functional relationship as a
generating mechanism. E.g. Xt1 F(Xt), where Xt
is an appropriate vector of past values and F is
the generating mechanism
B. Assume that the map F has multiple braches.
Then the returned output represents the
probability of obtaining Xt1 in any one of the
branches of F.
C. Divide the input to a set of classes and try
to learn the map from input to classes, I.e. a
classification problem.

Prediction
19
Time Series Prediction using Neural Networks

To apply a neural network model in time series
prediction we we have to make choices on the
following issues
Preparing the data
Transforming the data (see above)
Handling missing values
Smoothing the data (if needed)
Scale the data (almost always a good idea!)
Dimensionality reduction (principal component
analysis, factor analysis)
De-correlating data
Extracting Features (I.e. combination of
variables)

TS NNs
20
Time Series Prediction using Neural Networks II

Representing variables
Continuous or discrete
Semantics of variables (i.e. probabilities,
categories, data points, etc)
Distributed or atomic representation
Variables with little information content can be
harmful in generalisation
In Bayesian estimation the method of Automatic
Relevance Determination can be used for selecting
variables
Selecting Features
Capturing of causal relations

TS NNs
21
Time Series Prediction using Neural Networks III

Discovering memory in the generating process
Trial and error
Partial Auto-correlation functions (linear)
Mutual Information function (non-linear)
Methods from Dynamical Systems theory
Determination of past values by fitting a model
(e.g. linear) and eliminating past values with
small contribution based on sensitivity.

TS NNs
22
Time Series Prediction using Neural Networks IV

Selecting an architecture
Type of training
Family of models
Transfer function
Memory
Network Topology
Other parameters in network specification.
Model selection
See discussion in WK3

TS NNs
23
Time Series Prediction using Neural Networks V

Determination of Confidence Intervals
Jacknife Method (a linear approximation of
Bootstrap)
Bootstrap
Moving Blocks Bootstrap
Bootstrap t-interval
Bootstrap percentile interval
Bias-corrected and accelerated Bootstrap.

TS NNs
24
Time Series Prediction using Neural Networks VI

Additional Literature
Masters T. (1995). Neural, Novel Hybrid
Algorithms for Time Series Prediction, Wiley.
Pawitan Y, (2001). In all Likelihood Statistical
Modelling and Inference Using Likelihood, Oxford
University Press.
Chatfield C. (1989). The analysis of time series.
An introduction. 4th Ed. Chapman Hall.
Harvey A (1993). Time Series Models, Harvester
Wheatsheaf.
Efron B., Tibshirani R. (1993). An introduction
to Bootstrap, Chapman and Hall.

TS NNs
25
Radial Basis Function Model

There are only three layers Input, Hidden and
Output. There is only one hidden layer.

RBF Model
26
Radial Basis Function Model II

The hidden layer provides a non-linear
transformation of the input space to the hidden
space, which is assumed usually of high enough
dimension.
The output layer combines in a linear way the
activations of the hidden layer.
Note The RBF model owns its development on ideas
of fitting hyper-surfaces to data points in a
high-dimensional space.
In Numerical Analysis, radial-basis functions
were introduced for the solution of real
multivariate interpolation problems.

RBF Model
27
Radial Basis Function Model III

In the RBF model the hidden units provide a set
of functions that constitute an arbitrary
basis for the input patterns when they are
expanded to the hidden space.
The inspiration for the RBF model is based on
Covers theorem (1965) on the separability of
patterns
A complex pattern-classification problem cast in
a high-dimensional space nonlinearly is more
likely to be linearly separable than in a
low-dimensional space.
This leads to consider the multivariable
interpolation problem in high-dimensional space

RBF Model
28
Radial Basis Function Model IV

Given a set of N different points xi ?Rm0
I1,2,..,N and a corresponding set of N real
numbers di ?R1 I1,2,,N, find a function
FRN ? R1 that satisfies the interpolation
condition
F(xi) di , I1,2,,N
For strict interpolation the interpolating
surface, i.e. F, is constrained to pass through
all data points.
The radial-basis function (RBF) technique
consists of choosing a function F that has the
following form

RBF Model
29
Radial Basis Function Model V

Where ?(x-xi) I1,2,,N is a set of N
arbitrary functions, known as radial-basis
functions, and denotes a norm, which is
usually the Euclidean. The data points xi ?Rm0
are taken to be the centers of the radial-basis
functions.
Assume that d describes the desired response
vector and w is the linear weight vector. N is
the size of the training set. Let ? denote an N x
N matrix with elements
?ij ?(xj-xi) , (j,i)1,2,..,N
? is called the interpolation matrix.

RBF Model
30
Radial Basis Function Model VI

Thus according to the above theorem we can write
w x
The solution for the weight vector is
W ?-1x
Assuming that ? is non-singular. The Micchellis
Theorem provides assurances for a set of
functions that create non-singular matrix ?
Let xii1N be a set of distinct points in Rm0 .
Then the N x N interpolation matrix ? is
nonsingular.

RBF Model
31
Radial Basis Function Model VII

Functions that are covered by Micchallis theorem
include
Multiquadrics
?(r)(r2 c2)½ cgt0, r ?R
Inverse Multiquadrics
?(r)1/(r2 c2)½ cgt0, r ?R
Gaussian functions
?(r)exp(-r2/2?2) ?gt0, r ?R
All that is required for nonsigular ? is that the
points x be different.

RBF Model
32
Radial Basis Function Model VIII

Universal Approximation Theorem for RBF Networks
For any continuous input-output mapping function
f(x) there is an RBF network with a set of
centers tii1m1 and a common width ?gt0 such
that the input-output mapping function F(x)
realized by the RBF network is close to f(x) in
the Lp norm, p ? 1,?.
The RBF network is consisting of functions F Rm0
? R represented by

RBF Model
33
Radial Basis Function Model IX

Results on Sample Complexity, Computational
Complexity and Generalisation Performance for RBF
Networks
The generalisation error converges to zero only
if the number of hidden units m1, increases more
slowly than the size N of the training sample
For a given size N of training sample, the
optimum number of hidden units, m1 , behaves as
m1 ?
N1/3
The RBF network exhibits a rate of approximation
O (1/ m1) that is similar to that of an MLP with
sigmoid activation functions.

RBF Model
34
Radial Basis Function Model X

Comparison of MLP and RBF networks
An RBF network has a single hidden layer. An MLP
has one or more hidden layers
Typically the nodes of an MLP in a hidden or
output layer share the same neuronal model. On
the other hand the nodes of an RBF in a hidden
layer play a different role than those in the
output layer
The hidden layer of an RBF is non-linear. The
output layer is linear. Typically in an MLP both
layers are nonlinear

RBF Model
35
Radial Basis Function Model XI

An RBF network computes as argument of its
activation function the Euclidean norm of the
input vector and the center of the unit. In MLP
networks the activation function computes the
inner product of the input vector and the weight
vector of the node
MLPs are global approximators RBFs are local
approximators due to the localised decaying
Gaussian (or other) function.

RBF Model
36
Learning Law for Radial Basis Networks

To develop a learning law for RBF networks we
assume that the error function has the following
form
Where N is the size of the training sample used
to do the learning, and ej is the error signal
defined by

RBF Model
37
Learning Law for Radial Basis Networks II

We need to find the free parameters wi, ti and
?-1 so as to minimise E. Ci is a norm weighting
matrix, i.e.
xC2 (Cx)T(Cx)xCTCx
We use a weighted norm matrix when the individual
elements of x belong to different classes.
To calculate the update equations we use gradient
descent on the instantaneous error function E. We
get the following update rules for the free
parameters

RBF Model
38
Learning Law for Radial Basis Networks III

Linear weights (output layer)
i1,2,,m1
Positions of centers (hidden layer)
i1,2,,m1

RBF Model
39
Learning Law for Radial Basis Networks IV

Spreads of centers (hidden layer)
Note that three different learning rates ?1, ?2,
?3 are used in the gradient descent equations.

RBF Model
40
Conclusions

In time series modelling we seek to extract the
maximum possible structure we can find in the
series.
We terminate the analysis of a series when the
residuals do not contain any more structure, i.e.
they have an IID structure.
NN can be used as models in time series
prediction.
RBF networks are a second paradigm of multi layer
perceptrons.
They are inspired by interpolation theory
(numerical analysis)
They can be trained with the gradient descent
method, the same as the MLP case.

Conclusions

Write a Comment

User Comments (0)