Title: Nonlinear Pattern Association
1Nonlinear Pattern Association
2Limitations of linear associators
- Can only learn linearly separable problems
- Cannot take advantage of multiple layers - all
reduce to functioning like a single layer of
weights (direct connection between input and
output). - Although you can introduce nonlinearities by
adding additional input units (x2, log(x)), you
must determine which ones to add.
3New concept - Activation Functions
- Thus far, weve computed the activation of the
output by aj Swijai (or OW.I) - Where i is an input unit and j is an output unit.
- We can generalize this by introducing an
activation function - aj f(Swijai)
- For linear networks, the function is linear, but
there are other possible choices.
4Linear
Sigmoid
5Threshold
Linear Threshold
6Sigmoid is most common activation function
Sigmoid aka ogive or logistic function.
7Compensating for the shape
- When using the delta rule, we now need to
compensate for the nonlinear shape of the
function when deciding how much to change a
weight.
d is the error, t the target outputs, ao the
actual output vector and ai the input vector.
The f is the derivative of the activation
function.
8f(x) f(x)
- The inclusion of the derivative function insures
that changes are maximal for units whose
activation is in the middle of the range (i.e.,
most sensitive to input changes). - Note f(x) must be differentiable!
9Multiple Layers in Networks
- The key to increasing the power of pattern
associators is to add layers between the input
and output layers.
10A simple multi-layer network
11Computing error - a problem
- Its still easy to measure the error of the
output units because their output can be directly
compared to the target values (delta rule). - But, how do you measure the error of a hidden
unit - there is no target value?
12Hidden Unit Error Backpropagation
- The Big Idea
- A hidden unit is in error to the extent that it
contributes to the error of units to which it
provides direct input. - The algorithm back propagates the output error
backward through the system. - This is simply away of identifying who is to
blame and how much method of blame assignment!
13Computing hidden unit error
Weights are changed just like before ?Wij??jai
14It works!
- Backprop nets as universal function
approximators - Technically, a backprop network can approximate
any functional relationship. - However, its ability to do so depends on the
number of hidden units and the initial weights of
the network.
15Example in JMP
16Issues in backprop
- Local minima
- Learning rate
- Initial random weight choice
- Number of hidden units
- Input representations
- Output activation functions
- Catastrophic retroactive interference
17Local minima
- Nonlinear activation functions (unlike linear
ones) may produce local minima - what to do? - Momentum
- Weight change is a function of the ?W determined
by network error and the ?W of the previous
training trial - ?Wt ?OuterProduct(?ain) ?Wt-1
- Stochastic units with annealing
- Temperature changes the slope of the logistic.
- Run multiple times with different initial weights
and choose best solution (minimal error).
18Learning rate
- Steps too big - failure to converge
- Steps too small - slow to converge
- Modern implementations use adaptive learning
rates (e.g., R)
19Choice of initial random weights is important
- If they are too large, then units will saturate
(be at their min or max activation level) too
quickly - Remember the effect of f?
- This results in a local minimum.
- A good rule of thumb is keep initial values on
the order of k where k is the fan-in of a
typical unit. - Also, use smaller initial weights if your inputs
are not between 0 and 1.
20How many hidden units?
- First, how many layers are necessary?
- At most two (Cybenko, 19881989)
- At most one to approximate any continuous
function - How many hidden units
- Arbitrarily high accuracy can be achieved by
increasing the number of hidden units - Not generally known how to determine the optimal
or required number of hidden units - Corollary problem too many hidden units can
provide arbitrarily high accuracy at the cost of
overfitting - Common strategy is to use the technique of
cross-validation to insure that the solution
generalizes.
21Optimal network architectures
- Pruning and weight decay
- Weight decay can be used to get rid of useless
connections (Hinton, 1986) - wijnew (1-?)wijold
- Unfortunately, this equation can overly penalize
large weights which cost more than small
weights. - i.e., large weights decay faster than small ones
- An alternative that causes small weights to decay
faster than large ones is to use the following
equation for epsilon - It might also be advantageous to remove units
from the architecture.
22Input representations
- The way that a problem is represented is critical
- Example 1 Train a network to respond 1 when the
input number is odd. - Example 2 If the response of the system depends
on the order of the inputs, it wont be able to
solve it. - To a network, each unit is arbitrarily ordered!
- i.e., you can turn all of the patterns backward
(before training) and the system would perform
identically - Solla (1988) offered a technique for dealing with
the order problem by having the input
representation be broadly tuned adjacent units
affect one another. - For example, input j might really be j b(i)
b(k) where unit j is between units i and k
23Output activation functions
- Note that a sigmoid activation function on the
output units will limit output values to 0,1. - This value could be transformed but
- Often a good idea to use a linear output
activation function to do this mapping.
24Catastrophic retroactive interference
- McCloskey Cohen, Ratcliff (Discuss)
- Attempts to solve the problem
- Most have one thing in common reduce the amount
of overlap between representations - Frenchs activation sharpening algorithm
- An extra step is added to backprop in which the
hidden unit activations are sharpened - Sharpening the most active hidden units have
their activations increased whereas the rest are
decreased. - Results in units being more specialized
25Orthogonalizing hidden unit representations
- For example, using Gaussian units or other
activation functions so that representations have
less in common (fewer units are active for any
given input pattern). - Can also be accomplished through an N-M-N encoder
or an unsupervised PCA learning network - Drawback
- Less distributed representations are less
efficient at storage - e.g., a completely localized representation
requires one unit for each concept to be stored. - Poorer generalization to similar patterns
26Rehearsal schemes
- This technique doesnt try to sparsify the
representations but tries to side-step the
problem. - Ratcliff (1990) proposed mixing old trials in
with new ones during training on the new ones to
insure that the system doesnt forget old
relationships. - McClelland, McNaughton, OReilly (1995)
- Used combination of
- developing sparse representations (purportedly
accomplished by the hippocampus) and - interleaved learning (a rehearsal scheme) in
which the hippocampus purportedly plays back
recent memories to the neocortex during sleep
27Drawbacks to rehearsal schemes
- Training is inefficient
- Must somehow store old relationships and keep
playing them back. - The sleep-based rehearsal scheme just seems
downright hokey
28Summary of Catastrophic Retroactive Interference
problem
- Only a problem with a dynamically changing
environment. - Methods exist to deal with problem. Some of
these methods are neurologically plausible. - E.g. limited receptive fields and lack of full
connectivity. - NOTE The attempts to reduce the interference do
not eliminate interference it still occurs for
very similar input patterns (as it should).
29Autoassociative N-M-N encoders
- Backprop can be used for autoassociation.
- The M hidden units serve as a bottleneck, a
reduced representation of the input. - The hidden unit representation will converge on
the first M principal components of the input
space. - Thus, the hidden units will tend to create an
orthogonal representation of the inputs.
30In Sum
- Backprop has been an extremely popular neural
network method because of its ability to function
as a universal function approximator. - Backprop, however, is slow (as originally
conceived), suffers from local minima (all
nonlinear methods do), can suffer from massive
retroactive interference, and lacks biological
plausibility (same weights used to feedforward
activation and backpropagate error). - Nevertheless, it is an excellent tool to add to
your repertoire.