Nonlinear Pattern Association - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Nonlinear Pattern Association

Description:

Cannot take advantage of multiple layers - all reduce to functioning like a ... Sigmoid aka ogive or logistic function. Compensating for the shape ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 31
Provided by: michae1249
Category:

less

Transcript and Presenter's Notes

Title: Nonlinear Pattern Association


1
Nonlinear Pattern Association
  • X?Y

2
Limitations of linear associators
  • Can only learn linearly separable problems
  • Cannot take advantage of multiple layers - all
    reduce to functioning like a single layer of
    weights (direct connection between input and
    output).
  • Although you can introduce nonlinearities by
    adding additional input units (x2, log(x)), you
    must determine which ones to add.

3
New concept - Activation Functions
  • Thus far, weve computed the activation of the
    output by aj Swijai (or OW.I)
  • Where i is an input unit and j is an output unit.
  • We can generalize this by introducing an
    activation function
  • aj f(Swijai)
  • For linear networks, the function is linear, but
    there are other possible choices.

4
Linear
Sigmoid
5
Threshold
Linear Threshold
6
Sigmoid is most common activation function
Sigmoid aka ogive or logistic function.
7
Compensating for the shape
  • When using the delta rule, we now need to
    compensate for the nonlinear shape of the
    function when deciding how much to change a
    weight.

d is the error, t the target outputs, ao the
actual output vector and ai the input vector.
The f is the derivative of the activation
function.
8
f(x) f(x)
  • The inclusion of the derivative function insures
    that changes are maximal for units whose
    activation is in the middle of the range (i.e.,
    most sensitive to input changes).
  • Note f(x) must be differentiable!

9
Multiple Layers in Networks
  • The key to increasing the power of pattern
    associators is to add layers between the input
    and output layers.

10
A simple multi-layer network
11
Computing error - a problem
  • Its still easy to measure the error of the
    output units because their output can be directly
    compared to the target values (delta rule).
  • But, how do you measure the error of a hidden
    unit - there is no target value?

12
Hidden Unit Error Backpropagation
  • The Big Idea
  • A hidden unit is in error to the extent that it
    contributes to the error of units to which it
    provides direct input.
  • The algorithm back propagates the output error
    backward through the system.
  • This is simply away of identifying who is to
    blame and how much method of blame assignment!

13
Computing hidden unit error
Weights are changed just like before ?Wij??jai
14
It works!
  • Backprop nets as universal function
    approximators
  • Technically, a backprop network can approximate
    any functional relationship.
  • However, its ability to do so depends on the
    number of hidden units and the initial weights of
    the network.

15
Example in JMP
16
Issues in backprop
  • Local minima
  • Learning rate
  • Initial random weight choice
  • Number of hidden units
  • Input representations
  • Output activation functions
  • Catastrophic retroactive interference

17
Local minima
  • Nonlinear activation functions (unlike linear
    ones) may produce local minima - what to do?
  • Momentum
  • Weight change is a function of the ?W determined
    by network error and the ?W of the previous
    training trial
  • ?Wt ?OuterProduct(?ain) ?Wt-1
  • Stochastic units with annealing
  • Temperature changes the slope of the logistic.
  • Run multiple times with different initial weights
    and choose best solution (minimal error).

18
Learning rate
  • Steps too big - failure to converge
  • Steps too small - slow to converge
  • Modern implementations use adaptive learning
    rates (e.g., R)

19
Choice of initial random weights is important
  • If they are too large, then units will saturate
    (be at their min or max activation level) too
    quickly
  • Remember the effect of f?
  • This results in a local minimum.
  • A good rule of thumb is keep initial values on
    the order of k where k is the fan-in of a
    typical unit.
  • Also, use smaller initial weights if your inputs
    are not between 0 and 1.

20
How many hidden units?
  • First, how many layers are necessary?
  • At most two (Cybenko, 19881989)
  • At most one to approximate any continuous
    function
  • How many hidden units
  • Arbitrarily high accuracy can be achieved by
    increasing the number of hidden units
  • Not generally known how to determine the optimal
    or required number of hidden units
  • Corollary problem too many hidden units can
    provide arbitrarily high accuracy at the cost of
    overfitting
  • Common strategy is to use the technique of
    cross-validation to insure that the solution
    generalizes.

21
Optimal network architectures
  • Pruning and weight decay
  • Weight decay can be used to get rid of useless
    connections (Hinton, 1986)
  • wijnew (1-?)wijold
  • Unfortunately, this equation can overly penalize
    large weights which cost more than small
    weights.
  • i.e., large weights decay faster than small ones
  • An alternative that causes small weights to decay
    faster than large ones is to use the following
    equation for epsilon
  • It might also be advantageous to remove units
    from the architecture.

22
Input representations
  • The way that a problem is represented is critical
  • Example 1 Train a network to respond 1 when the
    input number is odd.
  • Example 2 If the response of the system depends
    on the order of the inputs, it wont be able to
    solve it.
  • To a network, each unit is arbitrarily ordered!
  • i.e., you can turn all of the patterns backward
    (before training) and the system would perform
    identically
  • Solla (1988) offered a technique for dealing with
    the order problem by having the input
    representation be broadly tuned adjacent units
    affect one another.
  • For example, input j might really be j b(i)
    b(k) where unit j is between units i and k

23
Output activation functions
  • Note that a sigmoid activation function on the
    output units will limit output values to 0,1.
  • This value could be transformed but
  • Often a good idea to use a linear output
    activation function to do this mapping.

24
Catastrophic retroactive interference
  • McCloskey Cohen, Ratcliff (Discuss)
  • Attempts to solve the problem
  • Most have one thing in common reduce the amount
    of overlap between representations
  • Frenchs activation sharpening algorithm
  • An extra step is added to backprop in which the
    hidden unit activations are sharpened
  • Sharpening the most active hidden units have
    their activations increased whereas the rest are
    decreased.
  • Results in units being more specialized

25
Orthogonalizing hidden unit representations
  • For example, using Gaussian units or other
    activation functions so that representations have
    less in common (fewer units are active for any
    given input pattern).
  • Can also be accomplished through an N-M-N encoder
    or an unsupervised PCA learning network
  • Drawback
  • Less distributed representations are less
    efficient at storage
  • e.g., a completely localized representation
    requires one unit for each concept to be stored.
  • Poorer generalization to similar patterns

26
Rehearsal schemes
  • This technique doesnt try to sparsify the
    representations but tries to side-step the
    problem.
  • Ratcliff (1990) proposed mixing old trials in
    with new ones during training on the new ones to
    insure that the system doesnt forget old
    relationships.
  • McClelland, McNaughton, OReilly (1995)
  • Used combination of
  • developing sparse representations (purportedly
    accomplished by the hippocampus) and
  • interleaved learning (a rehearsal scheme) in
    which the hippocampus purportedly plays back
    recent memories to the neocortex during sleep

27
Drawbacks to rehearsal schemes
  • Training is inefficient
  • Must somehow store old relationships and keep
    playing them back.
  • The sleep-based rehearsal scheme just seems
    downright hokey

28
Summary of Catastrophic Retroactive Interference
problem
  • Only a problem with a dynamically changing
    environment.
  • Methods exist to deal with problem. Some of
    these methods are neurologically plausible.
  • E.g. limited receptive fields and lack of full
    connectivity.
  • NOTE The attempts to reduce the interference do
    not eliminate interference it still occurs for
    very similar input patterns (as it should).

29
Autoassociative N-M-N encoders
  • Backprop can be used for autoassociation.
  • The M hidden units serve as a bottleneck, a
    reduced representation of the input.
  • The hidden unit representation will converge on
    the first M principal components of the input
    space.
  • Thus, the hidden units will tend to create an
    orthogonal representation of the inputs.

30
In Sum
  • Backprop has been an extremely popular neural
    network method because of its ability to function
    as a universal function approximator.
  • Backprop, however, is slow (as originally
    conceived), suffers from local minima (all
    nonlinear methods do), can suffer from massive
    retroactive interference, and lacks biological
    plausibility (same weights used to feedforward
    activation and backpropagate error).
  • Nevertheless, it is an excellent tool to add to
    your repertoire.
Write a Comment
User Comments (0)
About PowerShow.com