Title: Multiple Layer Perceptron
1Multiple Layer Perceptron
2Limitations of Single Layer Perceptron
- The nonlinearity used in the perceptron (sign
function) was not differentiable ? cannot to
applied to multilayer - Solve only linearly separable cases
- Only Simple problems can be solved
- Not all logical Boolean functions can be
implemented by single perceptron - AND, OR, NAND. NOR is ok, but not for XOR
3Multi Layer Perceptron (MLP)
- Feed-forward network with one or more hidden
layers - The network consist of
- An input layer of source neurons
- Hidden layer(s) of computational neurons
- Output layer of computational neurons
- Input signals propagates toward output node
- Can be used for arbitrarily complex function
mapping - all functions of Boolean logic, Combination of
linear functions - Differential non-linear activation function with
relatively simple training algorithm back error
propagation algorithm
4Expressive power of MLP
- Can every decision be implemented by Three layer
? - Yes. Any continuous function from input to output
can be implemented with sufficient number of
hidden neurons - Kolmogorovs theorem, Fouriers theorem
5Expressive power of MLP
6Expressive power of MLP
7MLP Distinctive Characteristics
- Non-linear activation function
- differentiable
- Mostly sigmoidal function
- nonlinearity prevent reduction to single-layer
perceptron - One or more layers of hidden neurons
- progressively extracting more meaningful features
from input patterns - High degree of connectivity
- Nonlinearity and high degree of connectivity
makes theoretical analysis difficult - Learning process is hard to visualize
- BP is a landmark in NN computationally efficient
training
8Error back-propagation algorithm
- Supervised, error-correction learning algorithm
which is based on delta rule - Two computations in Training
- Forward pass
- computation of function signal
- input vector is applied to input nodes
- its effects propagate through the network
layer-by-layer - with fixed synaptic weights
- backward pass
- synaptic weights are adjusted to reduce error
signal - computation of an estimate of gradient vector
- gradient of error surface with respect to the
weights - error signal propagates backward, layer-by-layer
fashion
9Notation Three-layer back-propagation neural
network
Input signals
1
z
1
x
1
1
1
2
z
2
x
2
2
2
w
kj
i
w
ji
j
z
k
k
x
i
m
n
z
l
l
x
n
Hidden
Input
Output
layer
layer
layer
Error signals
10Back-Propagation Algorithm
- Error signal for neuron j at iteration n
- Total error energy
- C is set of the output nodes
- Average squared error energy
- average over all training sample
- cost function as a measure of learning
performance - Objective of Learning process
- adjust NN parameters (synaptic weights) to
minimize Eav - Weights updated pattern-by-pattern basis until
one epoch - complete presentation of the entire training set
11Notation
12Back Propagation Algorithm
- Gradient Descent
- For notational simplicity, we will drop time
index n
13BPA update rule for j?k (output node)
- Gradient
- determine the direction of search in weight space
- Sensitivity
- Describes how the overall error change with
units net activation
14BPA Update rule for i?j (hidden node)
15BP Summary
- forward pass
- backward pass
- recursively compute local gradient ?
- from output layer toward input layer
- synaptic weight change by delta rule
16With Activation Functions
- Sigmoid function
- Hyperbolic tangent function
17Output as Probabilities
- Modeling posteriors
- 0-1 Target value
- With infinite training data, output will produce
probability - Sum of outputs should be 1
- Exponential activation function
- Normalize outputs to sum to 1.0
- SOFTMAX winner-takes-all
- Max, value tranformed to 1, others to 0
18Feature Detection
- Hidden neurons act as feature detectors
- As learning progress, hidden neuron gradually
discover salient features that characterize
training data - Nonlinear transformation of input data to feature
space - Close resemblance to Fishers linear discriminant
19Approximation of Functions
- Non-linear input-output mapping
- M0 input space to ML output space
- What is the minimum number of hidden layers in a
MLP that provide approximate any continuous
mapping ? - Universal Approximation Theorem
- existence of approximation of arbitrary
continuous function - single hidden layer is sufficient for MLP to
compute a uniform ? approximation to a given
training set - not saying single layer is optimum in the sense
of training time, easy of implementation, or
generalization - Bound of Approximation Errors of single hidden
node NN - larger the number of hidden nodes, more accurate
the approximation - smaller the number of hidden nodes, more
accurate the empirical fit
20Training Set Size for Generalization
- Generalization
- Input-output mapping is correct for data never
seen before - Overfitting - Overtraining
- memorize training data, not the essence of the
training data - learns idiosyncrasy and noise
- Occams Razer
- find the simplest function among those which
satisfy given conditions - Genralization is influenced
- size of training set
- architecture of Neural Network
- Given architecture, determine the size of
training set for good generalization - Given set of training samples, determine the best
architecture for good generalization - VC dimension - theoretical basis
21Cross-Validation
- Validate learned model on different set to assess
the generalization performance - guarding against overfitting
- Partition Training set into
- Estimation subset
- validation subset
- cross-validation for
- best model selection
- determine when to stop training
22Model selection
Practical Techniques in Improving BP
- Choosing MLP with the best number of weights with
given N training samples - Issue is to choose r
- to minimize classification error of model trained
by the estimation set when it tested with the
validation set - Kearns(1996) Qualitative properties of optimum
r - Analysis with VC Dim
- for small complexity problem (desired response is
small compared to N), performance of
cross-validation is insensitive to r - single fixed r nearly optimal for wide range of
target function - suggest r 0.2
- 80 of training set is estimation set
23Training Protocol
- Stochastic training
- Select samples randomly
- Batch training
- Epoch single presentation of all training
samples - Weight are updated once in an epoch
- Robust with outliers
- Sequential mode
- for each training sample, synaptic weights are
updated - require less storage
- converge much fast, particularly training data is
redundant - Risky - Less controllable
- random order makes trapping at local minimum less
likely - Online training when
- Training Data is abundant
- Memory cost is high, storing impossible
24Practical Techniques in Improving BP
- Selection of activation function
- Parameters for the sigmoid
- Scaling input
- Target values
- Training with noise
- Manufacturing data
- Number of hidden units
- Number of hidden layers
- Initializing weights
- Learning rates
- Momentum
- Weight decay
- Learning with hints
- Stopping training
- Other criterion function
- Speeding up the learning
25Selection of Activation function
Practical Techniques in Improving BP
- If there are good reasons to select a particular
activation function, then do it - Mixture of Gaussian ? Gaussian activation
function - Properties of activation function
- Non-linear
- Saturate some max and min value
- Continuity and smooth
- Monotonicity nonessential
- Sigmod function has all the good properties
- Distributed representation vs local
represetnation - An input is to yield throughout several hidden
units or not
26Parameters of Sigmoid
Practical Techniques in Improving BP
- Centered at zero
- Anti-symmetric
- f(-net) - f(net)
- Faster learning
- Overall range and slope are not important
- Avoid f(.) become zero
- Network paralysis
27Scaling Input / Target value
Practical Techniques in Improving BP
- Standardize
- Large scale difference
- error depends mostly on large scale feature
- Shifted to Zero mean, unit variance
- Need full data set
- Target value
- Output is saturated
- In the training, the output never reach saturated
value - Full training never terminated
- (1 target category, -1 non-target categories)
is suggested
28Training with Noise / Manufacturing Data
Practical Techniques in Improving BP
- Training with Noise
- Generate virtual or surrogate training patterns
- Ex d-dim Gaussian random noise
- Variance of added data lt 1 (e.g. 0.1)
- Manufacturing Data
- If we know source of variation, we can
manufactrure data - e.g. rotation for OCR, image processing for
simulation of bold face character - Memory requirement is large
29Number of hidden units
Practical Techniques in Improving BP
- (hidden units) governs the expressive power of
net complexity of decision boundary - Well-separated ? fewer hidden nodes
- From complicated density, highly interspersed ?
many hidden nodes - Heuristics rule of thumb
- More training data yields better result
- ( weight )lt ( training data)
- ( weight ) ( training data)/10
- Adjust ( weight ) in response to the training
data - Start with a large number of hidden nodes, then
decay, prune weights
30Number of Hidden Layers
Practical Techniques in Improving BP
- Three, four or more layers is OK w/
differentiable activation function - But three layer is sufficient
- More layers ? more chance of local minima
- Single hidden layer vs double(multiple) hidden
layer - single HL NN is good for any approximation of
continuous function - double HL NN may be good some times
- double(multiple) hidden layer
- first hidden layer - local feature detection
- second hidden layer - global feature detection
- Problem-specific reason of more layers
- Each layer learns different aspects
- e.g. neocognitron case translation, rotation,
31Initializing Weights
Practical Techniques in Improving BP
- Not to set zero no learning take place
- Selection of good Seed for Fast and uniform
learning - Reach final equilibrium values at about the same
time - For standardized data
- Choose randomly from single distribution
- Give positive and negative values equally ? lt w
lt ? - If ? is too small, net activation is small
linear model - If ? is too large, hidden units will saturate
before learning begins - For d input unit network,
- Input weights
- Hidden to output weights
32Moment term
Practical Techniques in Improving BP
- benefit of preventing the learning process from
terminating in a shallow local minimum - where ? is momentum constant
- converge if 0?? ? ? 1, typical value 0.9
- the partial derivative has the same sign on
consecutive iterations, grows in magnitude -
accelerate descent - opposite sign - shrinks stabilizing effect
33Learning Rate ?
Practical Techniques in Improving BP
- Smaller learning-rate parameter makes smoother
path - increase rate of learning yet avoiding danger of
instability - First choice ? 0.1
- ? of last layer should be assigned smaller one
- last layer has large local gradient (by limiting
effect), learns fast - LeCuns suggestion learning rate is inversely
proportional to square root of the number of
synaptic connection ( m-1/2) - May change during training
34Heuristics of Acceleration with learning rate
parameter
- Adjustable weights should have own learning rate
parameter - Learning rate parameters should be allowed to
vary on iteration - If sign of the derivative is same for several
iteration, learning rate parameter should be
increased - Apply the Momentum idea even on learning rate
parameters - If sign of the derivative is alternating for
several iteration, learning rate parameter should
be decreased
35Weight Decay
Practical Techniques in Improving BP
- Heuristic Keep the weight small
- in order to simplying network and avoiding
overfitting - Start with many weights and decay them during
training simple !! - Small weights are eliminated
36Weight Sharing (tying)
- A set of cells in one layer using the same
incoming weight - It leads to all cells detecting the same feature,
though different positions in the image
(receptive fields) - Reducing number of parameters
- Better generalization
- Effect of convolution with a kernel defined by
the weights
37Network Pruning
Practical Techniques in Improving BP
- Minimizing network improves generalization
- less likely to learn idiosyncrasies or noise
- Network pruning
- eliminate synaptic weights w/ small magnitude
- Complexity-regularization
- tradeoff between reliability of training data and
goodness of the model - supervised learning by minimizing the risk
function - where
38Wald Statistics
Practical Techniques in Improving BP
- Estimate the importance of parameter in a model,
then Eliminate based on the estimation - Hessian-based Network Pruning
- Optimal Brain Surgeon
- Optimal Brain Damage
- Identify parameters whose deletion will cause the
least increase in Eav - by Tayer series
39Optimal Brain Surgeon
Practical Techniques in Improving BP
- Solve the optimization problem
- Saliency of wi
- represent the increase in the mean-squared error
from delete of wi - OBS procedure
- weight of small saliency will be deleted
- computation of the inverse of Hessian
- updating rule after prune
- Optimal Brain Damage
- OBS with assumption of the Hessian matrix is
diagonal - Computationally simple
40Hints
Practical Techniques in Improving BP
- Add output units for addressing ancillary problem
- Differ but related problem
- Trained with original classification problem and
ancillary one, simultaneously - After training, hint units descarded
- Benefit
- feature selection
- Improve hidden unit representation
41Stopping Criteria
Practical Techniques in Improving BP
- No well-defined stopping criteria
- Terminate when Gradient vector g(W) 0
- located at local or global minimum
- Terminate when error measure is stationary
- Terminate if NNs generalization performance is
adequate - Excessive training leads poor generalization
- Training progress from small initial weights
- Beginning linearity
- Progressed non-linearity picked up
- Therefore, immature termination of training
behaves like weight decay
42Stopping w/ Separate validation set
Practical Techniques in Improving BP
- Early stopping method
- after some training, with fixed synaptic weights
compute validation error - resume training after computing validation error
43Stopping method
Practical Techniques in Improving BP
- Amari(1996)
- for NltW
- early stopping improves generalization
- for Nlt30W
- overfitting occurs
- example w100, r0.07
- 93 for estimation, 7 for validation
- for Ngt30W
- early stopping improvement is small
- Leave-one-out method
for large W
44NeoCognitron
45Speeding up the learning
- Use 2nd-order analysis
- Hessian Matrix
- Newtons method
- Quick-prop
- Conjugate Gradient Descent