Multilayer Perceptrons - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Multilayer Perceptrons

Description:

Multilayer Perceptrons CS/CMPE 333 Neural Networks – PowerPoint PPT presentation

Number of Views:268
Avg rating:3.0/5.0
Slides: 49
Provided by: Asim150
Category:

less

Transcript and Presenter's Notes

Title: Multilayer Perceptrons


1
Multilayer Perceptrons
  • CS/CMPE 333 Neural Networks

2
Introduction
  • Multilayer perceptrons (MLPs) are multilayer
    feedforward networks with continuously
    differentiable nonlinear activation functions
  • MLPs can be considered as an extension of the
    simple perceptron to multiple layers of neurons
    (hence the name multilayer perceptron)
  • MLPs are trained by the popular supervised
    learning algorithm known as the error
    back-propagation algorithm
  • MLPs training involves both forward and backward
    information flow (hence the name BP for the
    algorithm)

3
Historical Note
  • The idea of having multiple layers of computation
    nodes date from the early days of neural networks
    (1960s). However, at that time no algorithm was
    available to train such networks
  • The emergence of the BP algorithm in the mid
    1980s made it possible to train MLPs, which
    opened the way for their widespread use
  • Since the mid 80s there has been great interest
    in neural networks in general and MLPs in
    particular with many contributions made to the
    theory and applications of neural networks

4
Distinguishing Characteristics of MLPs
  • Smooth nonlinear activation function
  • Biological basis
  • Ensures that the input-output mapping of the
    network is nonlinear
  • The smoothness is essential for the BP algorithm
  • One or more hidden layers
  • These layers enhance mapping capability of the
    network
  • Each additional layer can be viewed as adding
    more complexity to the feature detection
    capability of the network
  • High degree of connectivity
  • Highly distributed information processing
  • Fault tolerance

5
An MLP
6
Information Flow
  • Function signals input signal that moves forward
    from layer to layer
  • Error signal error signal originates at the
    output and propagates back layer-by-layer

7
The Back-Propagation (BP) Algorithm
  • The BP algorithm is a generalization of the LMS
    algorithm. While the LMS algorithm applies
    correction to one layer of weights, the BP
    algorithm provides a mechanism for applying
    correction to multiple layers of weights.
  • The BP algorithm apportion errors at output layer
    to errors at hidden layers. Once the error at a
    neuron has been determined the correction is
    distributed according to the delta rule (as in
    LMS algorithm).

8
Some Notations
  • Indices i, j, k identifies a neuron, such that
    the layer in which neuron k lies follows the
    layer in which neuron j and i lie.
  • wji weight associated with the link connecting
    neuron (or source node) i with neuron j
  • yj(n), dj(n), ej(n), vj(n) actual response,
    target response, error, and net activity level of
    neuron j when presented with pattern n
  • xi(n), oi(n) ith element of input vector n, and
    network output vector when presented with input
    vector x(n)
  • Neuron j has a bias input equal to -1 with weight
    wj0 Tj

9
(No Transcript)
10
BP Algorithm (1)
  • Output of neuron j at iteration n (nth training
    pattern)
  • ej(n) dj(n) yj(n)
  • Instantaneous sum of square errors of the network
  • ?(n) ½Sj?C ej2(n)
  • C set of all neurons in output layer
  • For N training patterns, the cost function is
  • ?av (1/N) Sn1 n ?(n)
  • ? is a function of the free parameters (weights
    and thresholds). The goal is to find the
    weights/threshdolds that minimize ?av
  • Weights are updated on a pattern-by-pattern basis

11
BP Algorithm (2)
  • Considering neuron j in output layer
  • vj(n) Si0 p wji(n)yi(n)
  • And
  • yj(n) f(vj(n))
  • Instantaneous error-correction learning
  • wji(n1) wji(n) ? d?(n)/dwji(n)
  • Using the chain rule, the gradient can be written
    as
  • d?(n)/dwji(n) d?(n)/dej(n)dej(n)/dyj(n)dy
    j(n)/dvj(n)dvj(n)/dwji(n)

12
BP Algorithm (3)
  • Computing the partial derivatives
  • d?(n)/dej(n) ej(n)
  • dej(n)/dyj(n) -1
  • dyj(n)/dvj(n) fj(vj(n))
  • dvj(n)/dwji(n) yi(n)
  • The gradient at iteration n wrt wji
  • d?(n)/dwji(n) - ej(n) fj(vj(n))yi(n)
  • The weight change at iteration n
  • ?wji(n) ? d?(n)/dwji(n) ? ej(n)
    fj(vj(n))yi(n)
  • ? dj(n)yi(n)
  • This is known as the delta rule

13
BP Algorithm (4)
  • Local gradient dj(n)
  • dj(n) - d?(n)/dej(n)dej(n)/dyj(n)dyj(n)/dv
    j(n)
  • ej(n) fj(vj(n))
  • Credit-assignment problem
  • How to computer ej(n) ?
  • How to penalize and reward neuron j (weights
    associated with neuron j) for ej(n) ?
  • For output layer ?
  • For hidden layer(s) ?

14
Output Layer
  • If neuron j lies in the output layer, the desired
    response dj(n) is known and the error ej(n) can
    be computed
  • Hence, the local gradient dj(n) can be computed,
    and the weights updated using the delta rule (as
    given by the equations on the preceding slides)

15
Hidden Layer (1)
16
Hidden Layer (2)
  • If neuron j lies in a hidden layer, we dont have
    a desired response to calculate the error signal
  • The local gradient has to be computed recursively
    by considering all neurons to which neuron j
    feeds.
  • dj(n) - d?(n)/dyj(n)dyj(n)/dvj(n)
  • - d?(n)/dyj(n)fj(vj(n))
  • We need to calculate the value of the partial
    derivative d?(n)/dyj(n)

17
Hidden Layer (3)
  • Cost function is
  • ?(n) ½Sk?C ek2(n)
  • Partial derivatives
  • d?(n)/dyj(n) Sk ek(n)dek(n)/dvk(n)dvk(n)/dyj
    (n)
  • dek(n)/dvk(n) ddk(n) yk(n)/dvk(n)
  • ddk(n) fk(vk(n))/dvk(n)
  • fk(vk(n)
  • dvk(n)/dyj(n) dSj0 q wkj(n)yj(n)/ dyj(n)
  • wkj(n)

18
Hidden Layer (4)
  • Thus, we have
  • d?(n)/dyj(n) - Sk ek(n) fk(vk(n)) wkj(n)
  • - Sk dk(n)wkj(n)
  • And
  • dj(n) fj(vj(n)) Sk dk(n)wkj(n)

19
BP Algorithm Summary
  • Delta rule
  • wji(n1) wji(n) ?wji(n)
  • where
  • ?wji(n) ?dj(n)yi(n)
  • And, dj(n) is given by
  • If neuron j lies in the output layer
  • dj(n) fj(n)ej(n)
  • If neuron j lies in a hidden layer
  • dj(n) fj(vj(n)) Sk dk(n)wkj(n)

20
Sigmoidal Nonlinearity
  • For the BP algorithm to work, we need the
    activation function f to be continuous and
    differentiable
  • For the logistic function
  • yj(n) fj(vj(n))
  • 1 exp(-vj(n))-1
  • dyj(n)/ dvj(n) fj(vj(n))
  • exp(-vj(n)/1 exp(-vj(n))2
  • yj(n)1 - yj(n)

21
Learning Rate
  • Learning rate parameter controls the rate and
    stability of the convergence of the BP algorithm
  • Smaller ? -gt slower, smoother convergence
  • Larger ? -gt faster, unstable (oscillatory)
    convergence
  • Momentum
  • ?wji(n) a?wji(n-1) ?dj(n)yi(n)
  • a usually positive constant called momentum
    term
  • Improves stability of convergence by adding a
    momentum from the previous change in weights
  • May prevent convergence to a shallow (local)
    minimum
  • This update rule is known as the generalized
    delta rule

22
Modes of Training
  • N number of training examples (input-output
    patterns or training set)
  • One complete presentation of the training set is
    called an epoch
  • It is good practice to randomize the order of
    presentation of patterns from one epoch to
    another so as to enhance the search in weight
    space

23
Pattern Mode of Training
  • Weights are updated after the presentation of
    each pattern
  • This is the manner in which the BP algorithm was
    derived in the preceding slides
  • Average weight change after N updates
  • ?wji 1/N Sn1 N ?wji(n)
  • - ?/N Sn1 N d?(n)/dwji(n)
  • - ?/N Sn1 N ej(n) dej(n)/dwji(n)

24
Batch Mode of Training
  • Weights are updated after the presentation of all
    patterns in the epoch
  • Average cost function
  • ?av 1/2N Sn1NSj?C ej2(n)
  • ?wji - ?d?av/dwji
  • - ?/N Sn1 N ej(n)dej(n)/ dwji
  • Comparison
  • The weight updates after a complete epoch is
    different for both modes
  • Pattern mode is an estimate for the batch mode
  • Pattern mode is suitable for on-line
    implementation, requires less storage, and
    provides better search (because it is stochastic)

25
Stopping Criteria
  • The BP algorithm is considered to have converged
    when Euclidean norm of the gradient vector
    reaches a sufficiently small gradient threshold
  • g(w) lt t
  • The BP algorithm is considered to have converged
    when the absolute rate of change in the average
    squared error per epoch is sufficiently small
  • ?av(w(n1)) - ?av(w(n))/?av(w(n-1) lt t
  • The BP algorithm is terminated at the weight
    vector wfinal when g(wfinal) lt t1 , or
    ?av(wfinal) lt t2
  • The BP algorithm is stopped when the networks
    generalization properties are adequapte

26
Initialization
  • The initial values assigned to the weights
    affects the performance of the network
  • If prior information is available, then it should
    be used to assign appropriate initial values to
    the weights
  • However, prior information is usually not known.
    Moreover, even when prior information of the
    problem is known, it is not possible to assign
    weights since the behavior of a MLP is complex
    and not understood completely.
  • If prior information is not available, the
    weights are initialized to uniform random values
    within a range 0, 1
  • Premature saturation

27
Premature Saturation
  • When the output of a neuron approaches the limits
    of the sigmoidal function, little change occurs
    in the weight. That is, the neuron is saturated,
    and learning and adaptation is hampered.
  • How to avoid premature saturation?
  • Initialize weights from uniform distribution
    within a small range
  • Use as few hidden neurons as possible
  • Premature saturation is least likely when neurons
    operature in the linear range (middle of
    sigmoidal function

28
The XOR Problem (1)
29
The XOR Problem (2)
30
Hyperbolic Tangent Activation Function
  • The hyperbolic tangent is an asymmetric sigmoidal
    function
  • Experience has indicated that using an asymmetric
    activation function can speed up learning (i.e.
    it requires fewer training iterations)
  • The tangent hyperbolic function varies from -a to
    a (as opposed to 0, a for the logistic
    function) (or -1 to 1 for a 1)
  • f(v) a tanh(bv)
  • a1 exp(-bv)1 exp(-bv)-1
  • 2a1 exp(-bv)-1 a
  • Suggested values for a and b a 1.716 b 2/3

31
Some Implementation Tips (1)
  • Normalize the desired (target) responses dj to
    lie within the limits of the activation function
  • If the activation function values range from
    -1.716 to 1.716, we can limit the desired
    responses to -1, 1
  • The weights and thresholds should be uniformly
    distributed within a small range to prevent
    saturation of the neurons
  • All neurons should desirably learn at the same
    rate. To achieve this, the learning-rate
    parameter can be set larger for layers further
    away from the output layer

32
Some Implementation Tips (2)
  • The order in which the training examples are
    presented to the network should be randomized
    (shuffled) from one epoch to another. This
    enhances search for a better local minima on the
    error surface.
  • Whenever prior information is available, include
    that in the learning process

33
Pattern Classification
  • Since outputs of MLPs are continuous we need to
    define decision rules for classification
  • In general, classification into m classes
    requires m output neurons
  • What decision rules should be used?
  • A pattern x is classified to class k, if output
    neuron k fires (i.e. its output is greater than
    a threshold)
  • The problem with this decision rule is that it is
    unambiguous more than one neurons may fire
  • A pattern x is classified to class k if the
    output of neuron k is greater than all other
    neurons
  • yk gt yj for all j not equal to k

34
Example (1)
  • Classify between two overlapping
    two-dimensional, Gaussian-distributed patterns
  • Conditional probability density function for the
    two classes
  • f(x C1) 1/2ps12 exp-1/2s12 x µ12
  • µ1 mean 0 0T and s12 variance 1
  • f(x C2) 1/2ps22 exp-1/2s22 x µ22
  • µ2 mean 2 0T and s22 variance 4
  • x x1 x2T two dimensional input
  • C1 and C2 class labels

35
Example (2)
36
Example (3)
37
Example (4)
  • Consider a two-input, four hidden neurons, and
    two-output MLP
  • Decision rule an input x is classified to C1 if
    y1 gt y2
  • The training set is generated from the
    probability distribution functions
  • Using BP algorithm, the network is trained for
    minimum mean-square-error
  • The testing set is generated from the probability
    distribution functions
  • The trained network is tested for correct
    classification
  • For other implementation details, see the Matlab
    code

38
Example (5)
39
Experimental Design
  • Number of hidden layers
  • Use the minimum number of hidden layers that
    gives the best performance (least
    mean-square-error or best generalization)
  • In general, more than 2 hidden layers is rarely
    necessary
  • Number of hidden layer neurons
  • Use the minimum number of hidden layer neurons
    (gt 2) that gives the best performance
  • Learning-rate and momentum parameters
  • The parameters that, on average, yield
    convergence to a local minimum in least number of
    epochs
  • The parameters that, on average or in worst-case,
    yield convergence to the global minimum in least
    number of epochs
  • The parameters that, on average, yield a network
    with best generalization

40
Generalization (1)
  • A neural network is trained to learn the
    input-output patterns presented to it by
    minimizing an error function (e.g.
    mean-square-error)
  • In other words, the neural network tries to learn
    the given input-output mapping or association as
    accurately as possible
  • But can it generalize properly ? And, what is
    generalization?
  • A network is said to generalize well when the
    input-output relationship computed by the network
    is correct (or nearly so) for input-output
    pattern (test data) never used in creating and
    training the network
  • In other words, a network generalizes well when
    it learns the input-output mapping of the system
    from which the training data is obtained

41
Generalization (2)
  • Properly fitted data good generalization

42
Generalization (3)
  • Over fitted data bad generalization

43
Generalization (4)
  • How to achieve good generalization?
  • In general, good generalization implies a smooth
    nonlinear input-output mapping
  • Rigorous mathematical criterion presented by
    Poggio and Girosi (1990)
  • In general, the simplest function that maps the
    input-output patterns would be smoother
  • In a neural network
  • use the simplest architecture possible with as
    few hidden neurons as needed for the mapping
  • use a training set that is consistent with the
    complexity of the architecture (i.e. more
    patterns for more complex networks)

44
Cross-Validation
  • The design of a neural network is experimental.
    We select the best network parameters based on
    a criterion
  • From statistics, cross-validation provides a
    systematic way of experimenting
  • Randomly partition data into training and testing
    samples
  • Further partition training sample into an
    estimation and an evaluation (cross-validation)
    sample
  • Find the best model by training with estimation
    sample and validating with evaluation sample
  • Once the model has been found, train on entire
    training sample and test generalization using the
    testing sample

45
Universal Approximator
  • A neural network can be thought of as a mapping
    from a p-dimensional Euclidean space to a
    q-dimensional Euclidean space. In other words, a
    neural network can learn a function s Rp -gt Rq
  • A multilayer feedforward network with nonlinear,
    bounded, monotone-increasing activation functions
    is a universal approximator
  • Universal approximation to learn a continuous
    nonlinear mapping to any degree of accuracy
  • This is an existence theorem. That is, it does
    not say anything about practical optimality
    (complexity, computation time, etc)

46
MLP and BP Remarks (1)
  • The BP algorithm is the most popular algorithm
    for supervised training of MLPs. It is popular
    because
  • It is simple to compute locally
  • It performs stochastic gradient descent in weight
    space (for pattern-by-pattern mode of training)
  • The BP algorithm does not have a biological
    basis. Many of its operations are not found in
    biological neural network. Nevertheless, it is of
    great engineering importance.
  • The hidden layers act as feature
    extractors/detectors
  • An MLP can be trained to learn an identity
    mapping (i.e. map a pattern to itself), in which
    case the hidden layers act as feature extractors

47
MLP and BP Remarks (2)
  • A MLP with BP is an universal approximator in the
    sense that it can approximate any continuous
    multivariate function to any desired degree of
    accuracy, provided that sufficiently many hidden
    neurons are available
  • The BP algorithm is a first-order approximation
    of the method of steepest descent. Consequently,
    it converges slowly.
  • The BP algorithm is a hill-climbing technique and
    therefore suffers from the possibility of getting
    trapped in a local minimum of the cost surface
  • The MLP scales poorly because of full connectivity

48
Accelerating Convergence
  • Four heuristics
  • Every adjustable network parameter of the cost
    function should have its own individual learning
    rate parameter
  • Every learning rate parameter should be allowed
    to vary from one iteration to the next
  • When the derivative of the cost function with
    respect to a synaptic weight has the same
    algebraic sign for several consecutive iterations
    of the algorithm, the learning rate parameter for
    that particular weight should be increased
  • When the algebraic sign of the derivative of the
    cost function with respect to a particular
    synaptic weight alternates for several
    consecutive iterations of the algorithm, the
    learning rate parameter for that weight should be
    decreased
Write a Comment
User Comments (0)
About PowerShow.com