Multiple Layer Perceptron - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Multiple Layer Perceptron

Description:

Weights updated pattern-by-pattern basis until one epoch ... nodes, then decay, prune weights ... Adjustable weights should have own learning rate parameter ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 46
Provided by: aiKai
Category:

less

Transcript and Presenter's Notes

Title: Multiple Layer Perceptron


1
Multiple Layer Perceptron
  • 2004? 2?
  • KAIST ????
  • ? ??

2
Limitations of Single Layer Perceptron
  • The nonlinearity used in the perceptron (sign
    function) was not differentiable ? cannot to
    applied to multilayer
  • Solve only linearly separable cases
  • Only Simple problems can be solved
  • Not all logical Boolean functions can be
    implemented by single perceptron
  • AND, OR, NAND. NOR is ok, but not for XOR

3
Multi Layer Perceptron (MLP)
  • Feed-forward network with one or more hidden
    layers
  • The network consist of
  • An input layer of source neurons
  • Hidden layer(s) of computational neurons
  • Output layer of computational neurons
  • Input signals propagates toward output node
  • Can be used for arbitrarily complex function
    mapping
  • all functions of Boolean logic, Combination of
    linear functions
  • Differential non-linear activation function with
    relatively simple training algorithm back error
    propagation algorithm

4
Expressive power of MLP
  • Can every decision be implemented by Three layer
    ?
  • Yes. Any continuous function from input to output
    can be implemented with sufficient number of
    hidden neurons
  • Kolmogorovs theorem, Fouriers theorem

5
Expressive power of MLP
6
Expressive power of MLP
7
MLP Distinctive Characteristics
  • Non-linear activation function
  • differentiable
  • Mostly sigmoidal function
  • nonlinearity prevent reduction to single-layer
    perceptron
  • One or more layers of hidden neurons
  • progressively extracting more meaningful features
    from input patterns
  • High degree of connectivity
  • Nonlinearity and high degree of connectivity
    makes theoretical analysis difficult
  • Learning process is hard to visualize
  • BP is a landmark in NN computationally efficient
    training

8
Error back-propagation algorithm
  • Supervised, error-correction learning algorithm
    which is based on delta rule
  • Two computations in Training
  • Forward pass
  • computation of function signal
  • input vector is applied to input nodes
  • its effects propagate through the network
    layer-by-layer
  • with fixed synaptic weights
  • backward pass
  • synaptic weights are adjusted to reduce error
    signal
  • computation of an estimate of gradient vector
  • gradient of error surface with respect to the
    weights
  • error signal propagates backward, layer-by-layer
    fashion

9
Notation Three-layer back-propagation neural
network
Input signals
1
z
1
x
1
1
1
2
z
2
x
2
2
2
w
kj
i
w
ji
j
z
k
k
x
i
m
n
z
l
l
x
n
Hidden
Input
Output
layer
layer
layer
Error signals
10
Back-Propagation Algorithm
  • Error signal for neuron j at iteration n
  • Total error energy
  • C is set of the output nodes
  • Average squared error energy
  • average over all training sample
  • cost function as a measure of learning
    performance
  • Objective of Learning process
  • adjust NN parameters (synaptic weights) to
    minimize Eav
  • Weights updated pattern-by-pattern basis until
    one epoch
  • complete presentation of the entire training set

11
Notation
12
Back Propagation Algorithm
  • Gradient Descent
  • For notational simplicity, we will drop time
    index n

13
BPA update rule for j?k (output node)
  • Gradient
  • determine the direction of search in weight space
  • Sensitivity
  • Describes how the overall error change with
    units net activation

14
BPA Update rule for i?j (hidden node)
  • Sensitivity

15
BP Summary
  • forward pass
  • backward pass
  • recursively compute local gradient ?
  • from output layer toward input layer
  • synaptic weight change by delta rule

16
With Activation Functions
  • Sigmoid function
  • Hyperbolic tangent function

17
Output as Probabilities
  • Modeling posteriors
  • 0-1 Target value
  • With infinite training data, output will produce
    probability
  • Sum of outputs should be 1
  • Exponential activation function
  • Normalize outputs to sum to 1.0
  • SOFTMAX winner-takes-all
  • Max, value tranformed to 1, others to 0

18
Feature Detection
  • Hidden neurons act as feature detectors
  • As learning progress, hidden neuron gradually
    discover salient features that characterize
    training data
  • Nonlinear transformation of input data to feature
    space
  • Close resemblance to Fishers linear discriminant

19
Approximation of Functions
  • Non-linear input-output mapping
  • M0 input space to ML output space
  • What is the minimum number of hidden layers in a
    MLP that provide approximate any continuous
    mapping ?
  • Universal Approximation Theorem
  • existence of approximation of arbitrary
    continuous function
  • single hidden layer is sufficient for MLP to
    compute a uniform ? approximation to a given
    training set
  • not saying single layer is optimum in the sense
    of training time, easy of implementation, or
    generalization
  • Bound of Approximation Errors of single hidden
    node NN
  • larger the number of hidden nodes, more accurate
    the approximation
  • smaller the number of hidden nodes, more
    accurate the empirical fit

20
Training Set Size for Generalization
  • Generalization
  • Input-output mapping is correct for data never
    seen before
  • Overfitting - Overtraining
  • memorize training data, not the essence of the
    training data
  • learns idiosyncrasy and noise
  • Occams Razer
  • find the simplest function among those which
    satisfy given conditions
  • Genralization is influenced
  • size of training set
  • architecture of Neural Network
  • Given architecture, determine the size of
    training set for good generalization
  • Given set of training samples, determine the best
    architecture for good generalization
  • VC dimension - theoretical basis

21
Cross-Validation
  • Validate learned model on different set to assess
    the generalization performance
  • guarding against overfitting
  • Partition Training set into
  • Estimation subset
  • validation subset
  • cross-validation for
  • best model selection
  • determine when to stop training

22
Model selection
Practical Techniques in Improving BP
  • Choosing MLP with the best number of weights with
    given N training samples
  • Issue is to choose r
  • to minimize classification error of model trained
    by the estimation set when it tested with the
    validation set
  • Kearns(1996) Qualitative properties of optimum
    r
  • Analysis with VC Dim
  • for small complexity problem (desired response is
    small compared to N), performance of
    cross-validation is insensitive to r
  • single fixed r nearly optimal for wide range of
    target function
  • suggest r 0.2
  • 80 of training set is estimation set

23
Training Protocol
  • Stochastic training
  • Select samples randomly
  • Batch training
  • Epoch single presentation of all training
    samples
  • Weight are updated once in an epoch
  • Robust with outliers
  • Sequential mode
  • for each training sample, synaptic weights are
    updated
  • require less storage
  • converge much fast, particularly training data is
    redundant
  • Risky - Less controllable
  • random order makes trapping at local minimum less
    likely
  • Online training when
  • Training Data is abundant
  • Memory cost is high, storing impossible

24
Practical Techniques in Improving BP
  • Selection of activation function
  • Parameters for the sigmoid
  • Scaling input
  • Target values
  • Training with noise
  • Manufacturing data
  • Number of hidden units
  • Number of hidden layers
  • Initializing weights
  • Learning rates
  • Momentum
  • Weight decay
  • Learning with hints
  • Stopping training
  • Other criterion function
  • Speeding up the learning

25
Selection of Activation function
Practical Techniques in Improving BP
  • If there are good reasons to select a particular
    activation function, then do it
  • Mixture of Gaussian ? Gaussian activation
    function
  • Properties of activation function
  • Non-linear
  • Saturate some max and min value
  • Continuity and smooth
  • Monotonicity nonessential
  • Sigmod function has all the good properties
  • Distributed representation vs local
    represetnation
  • An input is to yield throughout several hidden
    units or not

26
Parameters of Sigmoid
Practical Techniques in Improving BP
  • Centered at zero
  • Anti-symmetric
  • f(-net) - f(net)
  • Faster learning
  • Overall range and slope are not important
  • Avoid f(.) become zero
  • Network paralysis

27
Scaling Input / Target value
Practical Techniques in Improving BP
  • Standardize
  • Large scale difference
  • error depends mostly on large scale feature
  • Shifted to Zero mean, unit variance
  • Need full data set
  • Target value
  • Output is saturated
  • In the training, the output never reach saturated
    value
  • Full training never terminated
  • (1 target category, -1 non-target categories)
    is suggested

28
Training with Noise / Manufacturing Data
Practical Techniques in Improving BP
  • Training with Noise
  • Generate virtual or surrogate training patterns
  • Ex d-dim Gaussian random noise
  • Variance of added data lt 1 (e.g. 0.1)
  • Manufacturing Data
  • If we know source of variation, we can
    manufactrure data
  • e.g. rotation for OCR, image processing for
    simulation of bold face character
  • Memory requirement is large

29
Number of hidden units
Practical Techniques in Improving BP
  • (hidden units) governs the expressive power of
    net complexity of decision boundary
  • Well-separated ? fewer hidden nodes
  • From complicated density, highly interspersed ?
    many hidden nodes
  • Heuristics rule of thumb
  • More training data yields better result
  • ( weight )lt ( training data)
  • ( weight ) ( training data)/10
  • Adjust ( weight ) in response to the training
    data
  • Start with a large number of hidden nodes, then
    decay, prune weights

30
Number of Hidden Layers
Practical Techniques in Improving BP
  • Three, four or more layers is OK w/
    differentiable activation function
  • But three layer is sufficient
  • More layers ? more chance of local minima
  • Single hidden layer vs double(multiple) hidden
    layer
  • single HL NN is good for any approximation of
    continuous function
  • double HL NN may be good some times
  • double(multiple) hidden layer
  • first hidden layer - local feature detection
  • second hidden layer - global feature detection
  • Problem-specific reason of more layers
  • Each layer learns different aspects
  • e.g. neocognitron case translation, rotation,

31
Initializing Weights
Practical Techniques in Improving BP
  • Not to set zero no learning take place
  • Selection of good Seed for Fast and uniform
    learning
  • Reach final equilibrium values at about the same
    time
  • For standardized data
  • Choose randomly from single distribution
  • Give positive and negative values equally ? lt w
    lt ?
  • If ? is too small, net activation is small
    linear model
  • If ? is too large, hidden units will saturate
    before learning begins
  • For d input unit network,
  • Input weights
  • Hidden to output weights

32
Moment term
Practical Techniques in Improving BP
  • benefit of preventing the learning process from
    terminating in a shallow local minimum
  • where ? is momentum constant
  • converge if 0?? ? ? 1, typical value 0.9
  • the partial derivative has the same sign on
    consecutive iterations, grows in magnitude -
    accelerate descent
  • opposite sign - shrinks stabilizing effect

33
Learning Rate ?
Practical Techniques in Improving BP
  • Smaller learning-rate parameter makes smoother
    path
  • increase rate of learning yet avoiding danger of
    instability
  • First choice ? 0.1
  • ? of last layer should be assigned smaller one
  • last layer has large local gradient (by limiting
    effect), learns fast
  • LeCuns suggestion learning rate is inversely
    proportional to square root of the number of
    synaptic connection ( m-1/2)
  • May change during training

34
Heuristics of Acceleration with learning rate
parameter
  • Adjustable weights should have own learning rate
    parameter
  • Learning rate parameters should be allowed to
    vary on iteration
  • If sign of the derivative is same for several
    iteration, learning rate parameter should be
    increased
  • Apply the Momentum idea even on learning rate
    parameters
  • If sign of the derivative is alternating for
    several iteration, learning rate parameter should
    be decreased

35
Weight Decay
Practical Techniques in Improving BP
  • Heuristic Keep the weight small
  • in order to simplying network and avoiding
    overfitting
  • Start with many weights and decay them during
    training simple !!
  • Small weights are eliminated

36
Weight Sharing (tying)
  • A set of cells in one layer using the same
    incoming weight
  • It leads to all cells detecting the same feature,
    though different positions in the image
    (receptive fields)
  • Reducing number of parameters
  • Better generalization
  • Effect of convolution with a kernel defined by
    the weights

37
Network Pruning
Practical Techniques in Improving BP
  • Minimizing network improves generalization
  • less likely to learn idiosyncrasies or noise
  • Network pruning
  • eliminate synaptic weights w/ small magnitude
  • Complexity-regularization
  • tradeoff between reliability of training data and
    goodness of the model
  • supervised learning by minimizing the risk
    function
  • where

38
Wald Statistics
Practical Techniques in Improving BP
  • Estimate the importance of parameter in a model,
    then Eliminate based on the estimation
  • Hessian-based Network Pruning
  • Optimal Brain Surgeon
  • Optimal Brain Damage
  • Identify parameters whose deletion will cause the
    least increase in Eav
  • by Tayer series

39
Optimal Brain Surgeon
Practical Techniques in Improving BP
  • Solve the optimization problem
  • Saliency of wi
  • represent the increase in the mean-squared error
    from delete of wi
  • OBS procedure
  • weight of small saliency will be deleted
  • computation of the inverse of Hessian
  • updating rule after prune
  • Optimal Brain Damage
  • OBS with assumption of the Hessian matrix is
    diagonal
  • Computationally simple

40
Hints
Practical Techniques in Improving BP
  • Add output units for addressing ancillary problem
  • Differ but related problem
  • Trained with original classification problem and
    ancillary one, simultaneously
  • After training, hint units descarded
  • Benefit
  • feature selection
  • Improve hidden unit representation

41
Stopping Criteria
Practical Techniques in Improving BP
  • No well-defined stopping criteria
  • Terminate when Gradient vector g(W) 0
  • located at local or global minimum
  • Terminate when error measure is stationary
  • Terminate if NNs generalization performance is
    adequate
  • Excessive training leads poor generalization
  • Training progress from small initial weights
  • Beginning linearity
  • Progressed non-linearity picked up
  • Therefore, immature termination of training
    behaves like weight decay

42
Stopping w/ Separate validation set
Practical Techniques in Improving BP
  • Early stopping method
  • after some training, with fixed synaptic weights
    compute validation error
  • resume training after computing validation error

43
Stopping method
Practical Techniques in Improving BP
  • Amari(1996)
  • for NltW
  • early stopping improves generalization
  • for Nlt30W
  • overfitting occurs
  • example w100, r0.07
  • 93 for estimation, 7 for validation
  • for Ngt30W
  • early stopping improvement is small
  • Leave-one-out method

for large W
44
NeoCognitron
45
Speeding up the learning
  • Use 2nd-order analysis
  • Hessian Matrix
  • Newtons method
  • Quick-prop
  • Conjugate Gradient Descent
Write a Comment
User Comments (0)
About PowerShow.com