Multilayer Perceptrons

About This Presentation

Title:

Multilayer Perceptrons

Description:

Multilayer Perceptrons CS/CMPE 333 Neural Networks – PowerPoint PPT presentation

Number of Views:272

Avg rating:3.0/5.0

Slides: 49

Provided by: Asim150

Category:

more less

Transcript and Presenter's Notes

Title: Multilayer Perceptrons

1
Multilayer Perceptrons

CS/CMPE 333 Neural Networks

2
Introduction

Multilayer perceptrons (MLPs) are multilayer
feedforward networks with continuously
differentiable nonlinear activation functions
MLPs can be considered as an extension of the
simple perceptron to multiple layers of neurons
(hence the name multilayer perceptron)
MLPs are trained by the popular supervised
learning algorithm known as the error
back-propagation algorithm
MLPs training involves both forward and backward
information flow (hence the name BP for the
algorithm)

3
Historical Note

The idea of having multiple layers of computation
nodes date from the early days of neural networks
(1960s). However, at that time no algorithm was
available to train such networks
The emergence of the BP algorithm in the mid
1980s made it possible to train MLPs, which
opened the way for their widespread use
Since the mid 80s there has been great interest
in neural networks in general and MLPs in
particular with many contributions made to the
theory and applications of neural networks

4
Distinguishing Characteristics of MLPs

Smooth nonlinear activation function
Biological basis
Ensures that the input-output mapping of the
network is nonlinear
The smoothness is essential for the BP algorithm
One or more hidden layers
These layers enhance mapping capability of the
network
Each additional layer can be viewed as adding
more complexity to the feature detection
capability of the network
High degree of connectivity
Highly distributed information processing
Fault tolerance

5
An MLP
6
Information Flow

Function signals input signal that moves forward
from layer to layer
Error signal error signal originates at the
output and propagates back layer-by-layer

7
The Back-Propagation (BP) Algorithm

The BP algorithm is a generalization of the LMS
algorithm. While the LMS algorithm applies
correction to one layer of weights, the BP
algorithm provides a mechanism for applying
correction to multiple layers of weights.
The BP algorithm apportion errors at output layer
to errors at hidden layers. Once the error at a
neuron has been determined the correction is
distributed according to the delta rule (as in
LMS algorithm).

8
Some Notations

Indices i, j, k identifies a neuron, such that
the layer in which neuron k lies follows the
layer in which neuron j and i lie.
wji weight associated with the link connecting
neuron (or source node) i with neuron j
yj(n), dj(n), ej(n), vj(n) actual response,
target response, error, and net activity level of
neuron j when presented with pattern n
xi(n), oi(n) ith element of input vector n, and
network output vector when presented with input
vector x(n)
Neuron j has a bias input equal to -1 with weight
wj0 Tj

9
(No Transcript)
10
BP Algorithm (1)

Output of neuron j at iteration n (nth training
pattern)
ej(n) dj(n) yj(n)
Instantaneous sum of square errors of the network
?(n) ½Sj?C ej2(n)
C set of all neurons in output layer
For N training patterns, the cost function is
?av (1/N) Sn1 n ?(n)
? is a function of the free parameters (weights
and thresholds). The goal is to find the
weights/threshdolds that minimize ?av
Weights are updated on a pattern-by-pattern basis

11
BP Algorithm (2)

Considering neuron j in output layer
vj(n) Si0 p wji(n)yi(n)
And
yj(n) f(vj(n))
Instantaneous error-correction learning
wji(n1) wji(n) ? d?(n)/dwji(n)
Using the chain rule, the gradient can be written
as
d?(n)/dwji(n) d?(n)/dej(n)dej(n)/dyj(n)dy
j(n)/dvj(n)dvj(n)/dwji(n)

12
BP Algorithm (3)

Computing the partial derivatives
d?(n)/dej(n) ej(n)
dej(n)/dyj(n) -1
dyj(n)/dvj(n) fj(vj(n))
dvj(n)/dwji(n) yi(n)
The gradient at iteration n wrt wji
d?(n)/dwji(n) - ej(n) fj(vj(n))yi(n)
The weight change at iteration n
?wji(n) ? d?(n)/dwji(n) ? ej(n)
fj(vj(n))yi(n)
? dj(n)yi(n)
This is known as the delta rule

13
BP Algorithm (4)

Local gradient dj(n)
dj(n) - d?(n)/dej(n)dej(n)/dyj(n)dyj(n)/dv
j(n)
ej(n) fj(vj(n))
Credit-assignment problem
How to computer ej(n) ?
How to penalize and reward neuron j (weights
associated with neuron j) for ej(n) ?
For output layer ?
For hidden layer(s) ?

14
Output Layer

If neuron j lies in the output layer, the desired
response dj(n) is known and the error ej(n) can
be computed
Hence, the local gradient dj(n) can be computed,
and the weights updated using the delta rule (as
given by the equations on the preceding slides)

15
Hidden Layer (1)
16
Hidden Layer (2)

If neuron j lies in a hidden layer, we dont have
a desired response to calculate the error signal
The local gradient has to be computed recursively
by considering all neurons to which neuron j
feeds.
dj(n) - d?(n)/dyj(n)dyj(n)/dvj(n)
- d?(n)/dyj(n)fj(vj(n))
We need to calculate the value of the partial
derivative d?(n)/dyj(n)

17
Hidden Layer (3)

Cost function is
?(n) ½Sk?C ek2(n)
Partial derivatives
d?(n)/dyj(n) Sk ek(n)dek(n)/dvk(n)dvk(n)/dyj
(n)
dek(n)/dvk(n) ddk(n) yk(n)/dvk(n)
ddk(n) fk(vk(n))/dvk(n)
fk(vk(n)
dvk(n)/dyj(n) dSj0 q wkj(n)yj(n)/ dyj(n)
wkj(n)

18
Hidden Layer (4)

Thus, we have
d?(n)/dyj(n) - Sk ek(n) fk(vk(n)) wkj(n)
- Sk dk(n)wkj(n)
And
dj(n) fj(vj(n)) Sk dk(n)wkj(n)

19
BP Algorithm Summary

Delta rule
wji(n1) wji(n) ?wji(n)
where
?wji(n) ?dj(n)yi(n)
And, dj(n) is given by
If neuron j lies in the output layer
dj(n) fj(n)ej(n)
If neuron j lies in a hidden layer
dj(n) fj(vj(n)) Sk dk(n)wkj(n)

20
Sigmoidal Nonlinearity

For the BP algorithm to work, we need the
activation function f to be continuous and
differentiable
For the logistic function
yj(n) fj(vj(n))
1 exp(-vj(n))-1
dyj(n)/ dvj(n) fj(vj(n))
exp(-vj(n)/1 exp(-vj(n))2
yj(n)1 - yj(n)

21
Learning Rate

Learning rate parameter controls the rate and
stability of the convergence of the BP algorithm
Smaller ? -gt slower, smoother convergence
Larger ? -gt faster, unstable (oscillatory)
convergence
Momentum
?wji(n) a?wji(n-1) ?dj(n)yi(n)
a usually positive constant called momentum
term
Improves stability of convergence by adding a
momentum from the previous change in weights
May prevent convergence to a shallow (local)
minimum
This update rule is known as the generalized
delta rule

22
Modes of Training

N number of training examples (input-output
patterns or training set)
One complete presentation of the training set is
called an epoch
It is good practice to randomize the order of
presentation of patterns from one epoch to
another so as to enhance the search in weight
space

23
Pattern Mode of Training

Weights are updated after the presentation of
each pattern
This is the manner in which the BP algorithm was
derived in the preceding slides
Average weight change after N updates
?wji 1/N Sn1 N ?wji(n)
- ?/N Sn1 N d?(n)/dwji(n)
- ?/N Sn1 N ej(n) dej(n)/dwji(n)

24
Batch Mode of Training

Weights are updated after the presentation of all
patterns in the epoch
Average cost function
?av 1/2N Sn1NSj?C ej2(n)
?wji - ?d?av/dwji
- ?/N Sn1 N ej(n)dej(n)/ dwji
Comparison
The weight updates after a complete epoch is
different for both modes
Pattern mode is an estimate for the batch mode
Pattern mode is suitable for on-line
implementation, requires less storage, and
provides better search (because it is stochastic)

25
Stopping Criteria

The BP algorithm is considered to have converged
when Euclidean norm of the gradient vector
reaches a sufficiently small gradient threshold
g(w) lt t
The BP algorithm is considered to have converged
when the absolute rate of change in the average
squared error per epoch is sufficiently small
?av(w(n1)) - ?av(w(n))/?av(w(n-1) lt t
The BP algorithm is terminated at the weight
vector wfinal when g(wfinal) lt t1 , or
?av(wfinal) lt t2
The BP algorithm is stopped when the networks
generalization properties are adequapte

26
Initialization

The initial values assigned to the weights
affects the performance of the network
If prior information is available, then it should
be used to assign appropriate initial values to
the weights
However, prior information is usually not known.
Moreover, even when prior information of the
problem is known, it is not possible to assign
weights since the behavior of a MLP is complex
and not understood completely.
If prior information is not available, the
weights are initialized to uniform random values
within a range 0, 1
Premature saturation

27
Premature Saturation

When the output of a neuron approaches the limits
of the sigmoidal function, little change occurs
in the weight. That is, the neuron is saturated,
and learning and adaptation is hampered.
How to avoid premature saturation?
Initialize weights from uniform distribution
within a small range
Use as few hidden neurons as possible
Premature saturation is least likely when neurons
operature in the linear range (middle of
sigmoidal function

28
The XOR Problem (1)
29
The XOR Problem (2)
30
Hyperbolic Tangent Activation Function

The hyperbolic tangent is an asymmetric sigmoidal
function
Experience has indicated that using an asymmetric
activation function can speed up learning (i.e.
it requires fewer training iterations)
The tangent hyperbolic function varies from -a to
a (as opposed to 0, a for the logistic
function) (or -1 to 1 for a 1)
f(v) a tanh(bv)
a1 exp(-bv)1 exp(-bv)-1
2a1 exp(-bv)-1 a
Suggested values for a and b a 1.716 b 2/3

31
Some Implementation Tips (1)

Normalize the desired (target) responses dj to
lie within the limits of the activation function
If the activation function values range from
-1.716 to 1.716, we can limit the desired
responses to -1, 1
The weights and thresholds should be uniformly
distributed within a small range to prevent
saturation of the neurons
All neurons should desirably learn at the same
rate. To achieve this, the learning-rate
parameter can be set larger for layers further
away from the output layer

32
Some Implementation Tips (2)

The order in which the training examples are
presented to the network should be randomized
(shuffled) from one epoch to another. This
enhances search for a better local minima on the
error surface.
Whenever prior information is available, include
that in the learning process

33
Pattern Classification

Since outputs of MLPs are continuous we need to
define decision rules for classification
In general, classification into m classes
requires m output neurons
What decision rules should be used?
A pattern x is classified to class k, if output
neuron k fires (i.e. its output is greater than
a threshold)
The problem with this decision rule is that it is
unambiguous more than one neurons may fire
A pattern x is classified to class k if the
output of neuron k is greater than all other
neurons
yk gt yj for all j not equal to k

34
Example (1)

Classify between two overlapping
two-dimensional, Gaussian-distributed patterns
Conditional probability density function for the
two classes
f(x C1) 1/2ps12 exp-1/2s12 x µ12
µ1 mean 0 0T and s12 variance 1
f(x C2) 1/2ps22 exp-1/2s22 x µ22
µ2 mean 2 0T and s22 variance 4
x x1 x2T two dimensional input
C1 and C2 class labels

35
Example (2)
36
Example (3)
37
Example (4)

Consider a two-input, four hidden neurons, and
two-output MLP
Decision rule an input x is classified to C1 if
y1 gt y2
The training set is generated from the
probability distribution functions
Using BP algorithm, the network is trained for
minimum mean-square-error
The testing set is generated from the probability
distribution functions
The trained network is tested for correct
classification
For other implementation details, see the Matlab
code

38
Example (5)
39
Experimental Design

Number of hidden layers
Use the minimum number of hidden layers that
gives the best performance (least
mean-square-error or best generalization)
In general, more than 2 hidden layers is rarely
necessary
Number of hidden layer neurons
Use the minimum number of hidden layer neurons
(gt 2) that gives the best performance
Learning-rate and momentum parameters
The parameters that, on average, yield
convergence to a local minimum in least number of
epochs
The parameters that, on average or in worst-case,
yield convergence to the global minimum in least
number of epochs
The parameters that, on average, yield a network
with best generalization

40
Generalization (1)

A neural network is trained to learn the
input-output patterns presented to it by
minimizing an error function (e.g.
mean-square-error)
In other words, the neural network tries to learn
the given input-output mapping or association as
accurately as possible
But can it generalize properly ? And, what is
generalization?
A network is said to generalize well when the
input-output relationship computed by the network
is correct (or nearly so) for input-output
pattern (test data) never used in creating and
training the network
In other words, a network generalizes well when
it learns the input-output mapping of the system
from which the training data is obtained

41
Generalization (2)

Properly fitted data good generalization

42
Generalization (3)

Over fitted data bad generalization

43
Generalization (4)

How to achieve good generalization?
In general, good generalization implies a smooth
nonlinear input-output mapping
Rigorous mathematical criterion presented by
Poggio and Girosi (1990)
In general, the simplest function that maps the
input-output patterns would be smoother
In a neural network
use the simplest architecture possible with as
few hidden neurons as needed for the mapping
use a training set that is consistent with the
complexity of the architecture (i.e. more
patterns for more complex networks)

44
Cross-Validation

The design of a neural network is experimental.
We select the best network parameters based on
a criterion
From statistics, cross-validation provides a
systematic way of experimenting
Randomly partition data into training and testing
samples
Further partition training sample into an
estimation and an evaluation (cross-validation)
sample
Find the best model by training with estimation
sample and validating with evaluation sample
Once the model has been found, train on entire
training sample and test generalization using the
testing sample

45
Universal Approximator

A neural network can be thought of as a mapping
from a p-dimensional Euclidean space to a
q-dimensional Euclidean space. In other words, a
neural network can learn a function s Rp -gt Rq
A multilayer feedforward network with nonlinear,
bounded, monotone-increasing activation functions
is a universal approximator
Universal approximation to learn a continuous
nonlinear mapping to any degree of accuracy
This is an existence theorem. That is, it does
not say anything about practical optimality
(complexity, computation time, etc)

46
MLP and BP Remarks (1)

The BP algorithm is the most popular algorithm
for supervised training of MLPs. It is popular
because
It is simple to compute locally
It performs stochastic gradient descent in weight
space (for pattern-by-pattern mode of training)
The BP algorithm does not have a biological
basis. Many of its operations are not found in
biological neural network. Nevertheless, it is of
great engineering importance.
The hidden layers act as feature
extractors/detectors
An MLP can be trained to learn an identity
mapping (i.e. map a pattern to itself), in which
case the hidden layers act as feature extractors

47
MLP and BP Remarks (2)

A MLP with BP is an universal approximator in the
sense that it can approximate any continuous
multivariate function to any desired degree of
accuracy, provided that sufficiently many hidden
neurons are available
The BP algorithm is a first-order approximation
of the method of steepest descent. Consequently,
it converges slowly.
The BP algorithm is a hill-climbing technique and
therefore suffers from the possibility of getting
trapped in a local minimum of the cost surface
The MLP scales poorly because of full connectivity

48
Accelerating Convergence

Four heuristics
Every adjustable network parameter of the cost
function should have its own individual learning
rate parameter
Every learning rate parameter should be allowed
to vary from one iteration to the next
When the derivative of the cost function with
respect to a synaptic weight has the same
algebraic sign for several consecutive iterations
of the algorithm, the learning rate parameter for
that particular weight should be increased
When the algebraic sign of the derivative of the
cost function with respect to a particular
synaptic weight alternates for several
consecutive iterations of the algorithm, the
learning rate parameter for that weight should be
decreased