Connectionist Models: Backprop - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Connectionist Models: Backprop

Description:

Use a continuous, differentiable activation function (Sigmoid) ... Use a sigmoid squashing function ... Replace Step with Sigmoid (differentiable) function ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 42

Provided by: SriniNa2

Category:

more less

Transcript and Presenter's Notes

Title: Connectionist Models: Backprop

1
Connectionist Models Backprop

Jerome Feldman
CS182/CogSci110/Ling109
Spring 2008

2
Recruiting connections

Given that LTP involves synaptic strength changes
and Hebbs rule involves coincident-activation
based strengthening of connections
How can connections between two nodes be
recruited using Hebbss rule?

3
X
Y
4
X
Y
5
Finding a Connection
P (1-F) BK

P Probability of NO link between X and Y
N Number of units in a layer
B Number of randomly outgoing units per unit
F B/N , the branching factor
K Number of Intermediate layers, 2 in the
example

N
106 107
108
K
Paths (1-P k-1)(N/F) (1-P k-1)B
6
Finding a Connection in Random Networks
For Networks with N nodes and branching
factor, there is a high probability of finding
good links. (Valiant 1995)
7
Recruiting a Connection in Random Networks

Informal Algorithm
Activate the two nodes to be linked
Have nodes with double activation strengthen
their active synapses (Hebb)
There is evidence for a now print signal based
on LTP (episodic memory)

8
(No Transcript)
9
(No Transcript)
10
Has-color
Has-shape
Green
Round
11
Has-color
Has-shape
GREEN
ROUND
12
Hebbs rule is not sufficient

What happens if the neural circuit fires
perfectly, but the result is very bad for the
animal, like eating something sickening?
A pure invocation of Hebbs rule would strengthen
all participating connections, which cant be
good.
On the other hand, it isnt right to weaken all
the active connections involved much of the
activity was just recognizing the situation we
would like to change only those connections that
led to the wrong decision.
No one knows how to specify a learning rule that
will change exactly the offending connections
when an error occurs.
Computer systems, and presumably nature as well,
rely upon statistical learning rules that tend to
make the right changes over time. More in later
lectures.

13
Hebbs rule is insufficient

should you punish all the connections?

14
Models of Learning

Hebbian coincidence
Supervised correction (backprop)
Recruitment one-trial
Reinforcement Learning- delayed reward
Unsupervised similarity

15
Abbstract Neuron
Threshold Activation Function
16
Boolean XOR
o
h2
h1
x2
x1
17
Supervised Learning - Backprop

How do we train the weights of the network
Basic Concepts
Use a continuous, differentiable activation
function (Sigmoid)
Use the idea of gradient descent on the error
surface
Extend to multiple layers

18
Backprop

To learn on data which is not linearly separable
Build multiple layer networks (hidden layer)
Use a sigmoid squashing function instead of a
step function.

19
Tasks

Unconstrained pattern classification
Credit assessment
Digit Classification
Speech Recognition
Function approximation
Learning control
Stock prediction

20
Sigmoid Squashing Function
o u t p u t
w0
w2
wn
w1
y01
. . .
y2
yn
y1
i n p u t
21
The Sigmoid Function
ya
xnet
22
The Sigmoid Function
Output1
ya
Output0
xneti
23
The Sigmoid Function
Output1
Sensitivity to input
ya
Output0
xnet
24
Nice Property of Sigmoids
25
Gradient Descent
26
Gradient Descent on an error
27
Learning as Gradient Descent
Complex error surface for hypothetical network
training problem
Error surface for a 2-wt, linear network
28
Learning Rule Gradient Descent on an Root Mean
Square (RMS)

Learn wis that minimize squared error

29
Gradient Descent
30
Gradient Descent
global mimimum this is your goal
it should be 4-D (3 weights) but you get the idea
31
Backpropagation Algorithm

Generalization to multiple layers and multiple
output units

32
Back-Propagation Algorithm
Sigmoid

We define the error term for a single node to be
ti - yi

33
Backprop Details

Here we go

34
The output layer
The derivative of the sigmoid is just
35
Nice Property of Sigmoids
36
The hidden layer
37
Lets just do an example
0
0.8
1/(1e-0.5)
0.6
0
0.5
E Error ½ ?i (ti yi)2
0.6224
E ½ (t0 y0)2
0.5
E ½ (0 0.6224)2 0.1937
38
An informal account of BackProp
After amassing Dw for all weights and, change
each wt a little bit, as determined by the
learning rate
39
Backprop learning algorithm(incremental-mode)

n1
initialize w(n) randomly
while (stopping criterion not satisfied and
nltmax_iterations)
for each example (x,d)
- run the network with input x and compute the
output y
- update the weights in backward order starting
from those of the output layer
with computed using the (generalized)
Delta rule
end-for
n n1
end-while

40
Backpropagation Algorithm

Initialize all weights to small random numbers
For each training example do
For each hidden unit h
For each output unit k
For each output unit k
For each hidden unit h
Update each network weight wij

with
41
Backpropagation Algorithm
42
What if all the input To hidden node weights are
initially equal?
43
Momentum term

The speed of learning is governed by the learning
rate.
If the rate is low, convergence is slow
If the rate is too high, error oscillates without
reaching minimum.
Momentum tends to smooth small weight error
fluctuations.

the momentum accelerates the descent in steady
downhill directions. the momentum has a
stabilizing effect in directions that oscillate
in time.
44
Convergence

May get stuck in local minima
Weights may diverge
but often works well in practice
Representation power
2 layer networks any continuous function
3 layer networks any function

45
Pattern Separation and NN architecture
46
Local Minimum
USE A RANDOM COMPONENT SIMULATED ANNEALING
47
Adjusting Learning Rate and the Hessian

The Hessian H is the second derivative of E with
respect to w.
The Hessian, tells you about the shape of the
cost surface
The eigenvalues of H are a measure of the
steepness of the surface along the curvature
directions.
a large eigenvalue gt steep curvature gt need
small learning rate
the learning rate should be proportional to
1/eigenvalue

48
Overfitting and generalization
TOO MANY HIDDEN NODES TENDS TO OVERFIT
49
Stopping criteria

Sensible stopping criteria
total mean squared error change Back-prop is
considered to have converged when the absolute
rate of change in the average squared error per
epoch is sufficiently small (in the range 0.01,
0.1).
generalization based criterion After each
epoch the NN is tested for generalization. If the
generalization performance is adequate then stop.
If this stopping criterion is used then the part
of the training set used for testing the network
generalization will not be used for updating the
weights.

50
Overfitting in ANNs
51
Early Stopping (Important!!!)

Stop training when error goes up on validation
set

52
Stopping criteria

Sensible stopping criteria
total mean squared error change Back-prop is
considered to have converged when the absolute
rate of change in the average squared error per
epoch is sufficiently small (in the range 0.01,
0.1).
generalization based criterion After each
epoch the NN is tested for generalization. If the
generalization performance is adequate then stop.
If this stopping criterion is used then the part
of the training set used for testing the network
generalization will not be used for updating the
weights.

53
Architectural Considerations
What is the right size network for a given
job? How many hidden units? Too many no
generalization Too few no solution Possible
answer Constructive algorithm, e.g. Cascade
Correlation (Fahlman, Lebiere 1990)
etc
54
Network Topology

The number of layers and of neurons depend on the
specific task. In practice this issue is solved
by trial and error.
Two types of adaptive algorithms can be used
start from a large network and successively
remove some nodes and links until network
performance degrades.
begin with a small network and introduce new
neurons until performance is satisfactory.

55
Cascade Correlation

It starts with a minimal network, consisting only
of an input and an output layer.
Minimizing the overall error of a net, it adds
step by step new hidden units to the hidden
layer.
Cascade-Correlation is a supervised learning
architecture which builds a near minimal
multi-layer network topology.
The two advantages of this architecture are that
there is no need for a user to worry about the
topology of the network, and that
Cascade-Correlation learns much faster than the
usual learning algorithms.

56
Supervised vs Unsupervised Learning

Backprop requires a 'target'
how realistic is that?
Hebbian learning is unsupervised, but limited in
power
How can we combine the power of backprop (and
friends) with the ideal of unsupervised learning?

57
Autoassociative Networks

Network trained to reproduce the input at the
output layer
Non-trivial if number of hidden units is smaller
than inputs/outputs
Forced to develop compressed representations of
the patterns
Hidden unit representations may reveal natural
kinds (e.g. Vowels vs Consonants)
Problem of explicit teacher is circumvented

copy of input as target
input
58
Problems and Networks

Some problems have natural "good" solutions
Solving a problem may be possible by providing
the right armory of general-purpose tools, and
recruiting them as needed
Networks are general purpose tools.
Choice of network type, training, architecture,
etc greatly influences the chances of
successfully solving a problem
Tension Tailoring tools for a specific job Vs
Exploiting general purpose learning mechanism

59
Summary

Multiple layer feed-forward networks
Replace Step with Sigmoid (differentiable)
function
Learn weights by gradient descent on error
function
Backpropagation algorithm for learning
Avoid overfitting by early stopping

60
ALVINN drives 70mph on highways
61
Use MLP Neural Networks when

(vectored) Real inputs, (vectored) real outputs
Youre not interested in understanding how it
works
Long training times acceptable
Short execution (prediction) times required
Robust to noise in the dataset

62
Applications of FFNN

Classification, pattern recognition
FFNN can be applied to tackle non-linearly
separable learning problems.
Recognizing printed or handwritten characters,
Face recognition
Classification of loan applications into
credit-worthy and non-credit-worthy groups
Analysis of sonar radar to determine the nature
of the source of a signal
Regression and forecasting
FFNN can be applied to learn non-linear functions
(regression) and in particular functions whose
inputs is a sequence of measurements over time
(time series).

63
(No Transcript)
64
Extensions of Backprop Nets

Recurrent Architectures
Backprop through time

65
Elman Nets Jordan Nets
a

Updating the context as we receive input
In Jordan nets we model forgetting as well
The recurrent connections have fixed weights
You can train these networks using good ol
backprop

66
Recurrent Backprop
a
b
c
a
b
c
w2
w4
unrolling 3 iterations
a
b
c
a
b
c
w1
w3
w1
w2
w3
w4
a
b
c

well pretend to step through the network one
iteration at a time
backprop as usual, but average equivalent weights
(e.g. all 3 highlighted edges on the right are
equivalent)

Write a Comment

User Comments (0)