Nonlinear Pattern Association - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Nonlinear Pattern Association

Description:

Cannot take advantage of multiple layers - all reduce to functioning like a ... Sigmoid aka ogive or logistic function. Compensating for the shape ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 31

Provided by: michae1249

Category:

more less

Transcript and Presenter's Notes

Title: Nonlinear Pattern Association

1
Nonlinear Pattern Association

2
Limitations of linear associators

Can only learn linearly separable problems
Cannot take advantage of multiple layers - all
reduce to functioning like a single layer of
weights (direct connection between input and
output).
Although you can introduce nonlinearities by
adding additional input units (x2, log(x)), you
must determine which ones to add.

3
New concept - Activation Functions

Thus far, weve computed the activation of the
output by aj Swijai (or OW.I)
Where i is an input unit and j is an output unit.
We can generalize this by introducing an
activation function
aj f(Swijai)
For linear networks, the function is linear, but
there are other possible choices.

4
Linear
Sigmoid
5
Threshold
Linear Threshold
6
Sigmoid is most common activation function
Sigmoid aka ogive or logistic function.
7
Compensating for the shape

When using the delta rule, we now need to
compensate for the nonlinear shape of the
function when deciding how much to change a
weight.

d is the error, t the target outputs, ao the
actual output vector and ai the input vector.
The f is the derivative of the activation
function.
8
f(x) f(x)

The inclusion of the derivative function insures
that changes are maximal for units whose
activation is in the middle of the range (i.e.,
most sensitive to input changes).
Note f(x) must be differentiable!

9
Multiple Layers in Networks

The key to increasing the power of pattern
associators is to add layers between the input
and output layers.

10
A simple multi-layer network
11
Computing error - a problem

Its still easy to measure the error of the
output units because their output can be directly
compared to the target values (delta rule).
But, how do you measure the error of a hidden
unit - there is no target value?

12
Hidden Unit Error Backpropagation

The Big Idea
A hidden unit is in error to the extent that it
contributes to the error of units to which it
provides direct input.
The algorithm back propagates the output error
backward through the system.
This is simply away of identifying who is to
blame and how much method of blame assignment!

13
Computing hidden unit error
Weights are changed just like before ?Wij??jai
14
It works!

Backprop nets as universal function
approximators
Technically, a backprop network can approximate
any functional relationship.
However, its ability to do so depends on the
number of hidden units and the initial weights of
the network.

15
Example in JMP
16
Issues in backprop

Local minima
Learning rate
Initial random weight choice
Number of hidden units
Input representations
Output activation functions
Catastrophic retroactive interference

17
Local minima

Nonlinear activation functions (unlike linear
ones) may produce local minima - what to do?
Momentum
Weight change is a function of the ?W determined
by network error and the ?W of the previous
training trial
?Wt ?OuterProduct(?ain) ?Wt-1
Stochastic units with annealing
Temperature changes the slope of the logistic.
Run multiple times with different initial weights
and choose best solution (minimal error).

18
Learning rate

Steps too big - failure to converge
Steps too small - slow to converge
Modern implementations use adaptive learning
rates (e.g., R)

19
Choice of initial random weights is important

If they are too large, then units will saturate
(be at their min or max activation level) too
quickly
Remember the effect of f?
This results in a local minimum.
A good rule of thumb is keep initial values on
the order of k where k is the fan-in of a
typical unit.
Also, use smaller initial weights if your inputs
are not between 0 and 1.

20
How many hidden units?

First, how many layers are necessary?
At most two (Cybenko, 19881989)
At most one to approximate any continuous
function
How many hidden units
Arbitrarily high accuracy can be achieved by
increasing the number of hidden units
Not generally known how to determine the optimal
or required number of hidden units
Corollary problem too many hidden units can
provide arbitrarily high accuracy at the cost of
overfitting
Common strategy is to use the technique of
cross-validation to insure that the solution
generalizes.

21
Optimal network architectures

Pruning and weight decay
Weight decay can be used to get rid of useless
connections (Hinton, 1986)
wijnew (1-?)wijold
Unfortunately, this equation can overly penalize
large weights which cost more than small
weights.
i.e., large weights decay faster than small ones
An alternative that causes small weights to decay
faster than large ones is to use the following
equation for epsilon
It might also be advantageous to remove units
from the architecture.

22
Input representations

The way that a problem is represented is critical
Example 1 Train a network to respond 1 when the
input number is odd.
Example 2 If the response of the system depends
on the order of the inputs, it wont be able to
solve it.
To a network, each unit is arbitrarily ordered!
i.e., you can turn all of the patterns backward
(before training) and the system would perform
identically
Solla (1988) offered a technique for dealing with
the order problem by having the input
representation be broadly tuned adjacent units
affect one another.
For example, input j might really be j b(i)
b(k) where unit j is between units i and k

23
Output activation functions

Note that a sigmoid activation function on the
output units will limit output values to 0,1.
This value could be transformed but
Often a good idea to use a linear output
activation function to do this mapping.

24
Catastrophic retroactive interference

McCloskey Cohen, Ratcliff (Discuss)
Attempts to solve the problem
Most have one thing in common reduce the amount
of overlap between representations
Frenchs activation sharpening algorithm
An extra step is added to backprop in which the
hidden unit activations are sharpened
Sharpening the most active hidden units have
their activations increased whereas the rest are
decreased.
Results in units being more specialized

25
Orthogonalizing hidden unit representations

For example, using Gaussian units or other
activation functions so that representations have
less in common (fewer units are active for any
given input pattern).
Can also be accomplished through an N-M-N encoder
or an unsupervised PCA learning network
Drawback
Less distributed representations are less
efficient at storage
e.g., a completely localized representation
requires one unit for each concept to be stored.
Poorer generalization to similar patterns

26
Rehearsal schemes

This technique doesnt try to sparsify the
representations but tries to side-step the
problem.
Ratcliff (1990) proposed mixing old trials in
with new ones during training on the new ones to
insure that the system doesnt forget old
relationships.
McClelland, McNaughton, OReilly (1995)
Used combination of
developing sparse representations (purportedly
accomplished by the hippocampus) and
interleaved learning (a rehearsal scheme) in
which the hippocampus purportedly plays back
recent memories to the neocortex during sleep

27
Drawbacks to rehearsal schemes

Training is inefficient
Must somehow store old relationships and keep
playing them back.
The sleep-based rehearsal scheme just seems
downright hokey

28
Summary of Catastrophic Retroactive Interference
problem

Only a problem with a dynamically changing
environment.
Methods exist to deal with problem. Some of
these methods are neurologically plausible.
E.g. limited receptive fields and lack of full
connectivity.
NOTE The attempts to reduce the interference do
not eliminate interference it still occurs for
very similar input patterns (as it should).

29
Autoassociative N-M-N encoders

Backprop can be used for autoassociation.
The M hidden units serve as a bottleneck, a
reduced representation of the input.
The hidden unit representation will converge on
the first M principal components of the input
space.
Thus, the hidden units will tend to create an
orthogonal representation of the inputs.

30
In Sum

Backprop has been an extremely popular neural
network method because of its ability to function
as a universal function approximator.
Backprop, however, is slow (as originally
conceived), suffers from local minima (all
nonlinear methods do), can suffer from massive
retroactive interference, and lacks biological
plausibility (same weights used to feedforward
activation and backpropagate error).
Nevertheless, it is an excellent tool to add to
your repertoire.