Title: Connectionist Models: Backprop
1Connectionist Models Backprop
- Jerome Feldman
- CS182/CogSci110/Ling109
- Spring 2008
2Recruiting connections
- Given that LTP involves synaptic strength changes
and Hebbs rule involves coincident-activation
based strengthening of connections - How can connections between two nodes be
recruited using Hebbss rule?
3X
Y
4X
Y
5Finding a Connection
P (1-F) BK
- P Probability of NO link between X and Y
- N Number of units in a layer
- B Number of randomly outgoing units per unit
- F B/N , the branching factor
- K Number of Intermediate layers, 2 in the
example
N
106 107
108
K
Paths (1-P k-1)(N/F) (1-P k-1)B
6Finding a Connection in Random Networks
For Networks with N nodes and branching
factor, there is a high probability of finding
good links. (Valiant 1995)
7Recruiting a Connection in Random Networks
- Informal Algorithm
- Activate the two nodes to be linked
- Have nodes with double activation strengthen
their active synapses (Hebb) - There is evidence for a now print signal based
on LTP (episodic memory)
8(No Transcript)
9(No Transcript)
10Has-color
Has-shape
Green
Round
11Has-color
Has-shape
GREEN
ROUND
12Hebbs rule is not sufficient
- What happens if the neural circuit fires
perfectly, but the result is very bad for the
animal, like eating something sickening? - A pure invocation of Hebbs rule would strengthen
all participating connections, which cant be
good. - On the other hand, it isnt right to weaken all
the active connections involved much of the
activity was just recognizing the situation we
would like to change only those connections that
led to the wrong decision. - No one knows how to specify a learning rule that
will change exactly the offending connections
when an error occurs. - Computer systems, and presumably nature as well,
rely upon statistical learning rules that tend to
make the right changes over time. More in later
lectures.
13Hebbs rule is insufficient
- should you punish all the connections?
14Models of Learning
- Hebbian coincidence
- Supervised correction (backprop)
- Recruitment one-trial
- Reinforcement Learning- delayed reward
- Unsupervised similarity
15Abbstract Neuron
Threshold Activation Function
16Boolean XOR
o
h2
h1
x2
x1
17Supervised Learning - Backprop
- How do we train the weights of the network
- Basic Concepts
- Use a continuous, differentiable activation
function (Sigmoid) - Use the idea of gradient descent on the error
surface - Extend to multiple layers
18Backprop
- To learn on data which is not linearly separable
- Build multiple layer networks (hidden layer)
- Use a sigmoid squashing function instead of a
step function.
19Tasks
- Unconstrained pattern classification
- Credit assessment
- Digit Classification
- Speech Recognition
- Function approximation
- Learning control
- Stock prediction
-
20Sigmoid Squashing Function
o u t p u t
w0
w2
wn
w1
y01
. . .
y2
yn
y1
i n p u t
21The Sigmoid Function
ya
xnet
22The Sigmoid Function
Output1
ya
Output0
xneti
23The Sigmoid Function
Output1
Sensitivity to input
ya
Output0
xnet
24Nice Property of Sigmoids
25Gradient Descent
26Gradient Descent on an error
27Learning as Gradient Descent
Complex error surface for hypothetical network
training problem
Error surface for a 2-wt, linear network
28Learning Rule Gradient Descent on an Root Mean
Square (RMS)
- Learn wis that minimize squared error
29Gradient Descent
30Gradient Descent
global mimimum this is your goal
it should be 4-D (3 weights) but you get the idea
31Backpropagation Algorithm
- Generalization to multiple layers and multiple
output units
32Back-Propagation Algorithm
Sigmoid
- We define the error term for a single node to be
ti - yi
33Backprop Details
34The output layer
The derivative of the sigmoid is just
35Nice Property of Sigmoids
36The hidden layer
37Lets just do an example
0
0.8
1/(1e-0.5)
0.6
0
0.5
E Error ½ ?i (ti yi)2
0.6224
E ½ (t0 y0)2
0.5
E ½ (0 0.6224)2 0.1937
38An informal account of BackProp
After amassing Dw for all weights and, change
each wt a little bit, as determined by the
learning rate
39Backprop learning algorithm(incremental-mode)
- n1
- initialize w(n) randomly
- while (stopping criterion not satisfied and
nltmax_iterations) - for each example (x,d)
- - run the network with input x and compute the
output y - - update the weights in backward order starting
from those of the output layer - with computed using the (generalized)
Delta rule - end-for
- n n1
- end-while
40Backpropagation Algorithm
- Initialize all weights to small random numbers
- For each training example do
- For each hidden unit h
- For each output unit k
- For each output unit k
- For each hidden unit h
- Update each network weight wij
with
41Backpropagation Algorithm
42What if all the input To hidden node weights are
initially equal?
43Momentum term
- The speed of learning is governed by the learning
rate. - If the rate is low, convergence is slow
- If the rate is too high, error oscillates without
reaching minimum. - Momentum tends to smooth small weight error
fluctuations.
the momentum accelerates the descent in steady
downhill directions. the momentum has a
stabilizing effect in directions that oscillate
in time.
44Convergence
- May get stuck in local minima
- Weights may diverge
- but often works well in practice
- Representation power
- 2 layer networks any continuous function
- 3 layer networks any function
45Pattern Separation and NN architecture
46Local Minimum
USE A RANDOM COMPONENT SIMULATED ANNEALING
47Adjusting Learning Rate and the Hessian
- The Hessian H is the second derivative of E with
respect to w. - The Hessian, tells you about the shape of the
cost surface - The eigenvalues of H are a measure of the
steepness of the surface along the curvature
directions. - a large eigenvalue gt steep curvature gt need
small learning rate - the learning rate should be proportional to
1/eigenvalue
48Overfitting and generalization
TOO MANY HIDDEN NODES TENDS TO OVERFIT
49Stopping criteria
- Sensible stopping criteria
- total mean squared error change Back-prop is
considered to have converged when the absolute
rate of change in the average squared error per
epoch is sufficiently small (in the range 0.01,
0.1). - generalization based criterion After each
epoch the NN is tested for generalization. If the
generalization performance is adequate then stop.
If this stopping criterion is used then the part
of the training set used for testing the network
generalization will not be used for updating the
weights.
50Overfitting in ANNs
51Early Stopping (Important!!!)
- Stop training when error goes up on validation
set
52Stopping criteria
- Sensible stopping criteria
- total mean squared error change Back-prop is
considered to have converged when the absolute
rate of change in the average squared error per
epoch is sufficiently small (in the range 0.01,
0.1). - generalization based criterion After each
epoch the NN is tested for generalization. If the
generalization performance is adequate then stop.
If this stopping criterion is used then the part
of the training set used for testing the network
generalization will not be used for updating the
weights.
53Architectural Considerations
What is the right size network for a given
job? How many hidden units? Too many no
generalization Too few no solution Possible
answer Constructive algorithm, e.g. Cascade
Correlation (Fahlman, Lebiere 1990)
etc
54Network Topology
- The number of layers and of neurons depend on the
specific task. In practice this issue is solved
by trial and error. - Two types of adaptive algorithms can be used
- start from a large network and successively
remove some nodes and links until network
performance degrades. - begin with a small network and introduce new
neurons until performance is satisfactory.
55Cascade Correlation
- It starts with a minimal network, consisting only
of an input and an output layer. - Minimizing the overall error of a net, it adds
step by step new hidden units to the hidden
layer. - Cascade-Correlation is a supervised learning
architecture which builds a near minimal
multi-layer network topology. - The two advantages of this architecture are that
- there is no need for a user to worry about the
topology of the network, and that - Cascade-Correlation learns much faster than the
usual learning algorithms.
56Supervised vs Unsupervised Learning
- Backprop requires a 'target'
- how realistic is that?
- Hebbian learning is unsupervised, but limited in
power - How can we combine the power of backprop (and
friends) with the ideal of unsupervised learning?
57Autoassociative Networks
- Network trained to reproduce the input at the
output layer - Non-trivial if number of hidden units is smaller
than inputs/outputs - Forced to develop compressed representations of
the patterns - Hidden unit representations may reveal natural
kinds (e.g. Vowels vs Consonants) - Problem of explicit teacher is circumvented
copy of input as target
input
58Problems and Networks
- Some problems have natural "good" solutions
- Solving a problem may be possible by providing
the right armory of general-purpose tools, and
recruiting them as needed - Networks are general purpose tools.
- Choice of network type, training, architecture,
etc greatly influences the chances of
successfully solving a problem - Tension Tailoring tools for a specific job Vs
Exploiting general purpose learning mechanism
59Summary
- Multiple layer feed-forward networks
- Replace Step with Sigmoid (differentiable)
function - Learn weights by gradient descent on error
function - Backpropagation algorithm for learning
- Avoid overfitting by early stopping
60ALVINN drives 70mph on highways
61Use MLP Neural Networks when
- (vectored) Real inputs, (vectored) real outputs
- Youre not interested in understanding how it
works - Long training times acceptable
- Short execution (prediction) times required
- Robust to noise in the dataset
62Applications of FFNN
- Classification, pattern recognition
- FFNN can be applied to tackle non-linearly
separable learning problems. - Recognizing printed or handwritten characters,
- Face recognition
- Classification of loan applications into
credit-worthy and non-credit-worthy groups - Analysis of sonar radar to determine the nature
of the source of a signal - Regression and forecasting
- FFNN can be applied to learn non-linear functions
(regression) and in particular functions whose
inputs is a sequence of measurements over time
(time series).
63(No Transcript)
64Extensions of Backprop Nets
- Recurrent Architectures
- Backprop through time
65Elman Nets Jordan Nets
a
- Updating the context as we receive input
- In Jordan nets we model forgetting as well
- The recurrent connections have fixed weights
- You can train these networks using good ol
backprop
66Recurrent Backprop
a
b
c
a
b
c
w2
w4
unrolling 3 iterations
a
b
c
a
b
c
w1
w3
w1
w2
w3
w4
a
b
c
- well pretend to step through the network one
iteration at a time - backprop as usual, but average equivalent weights
(e.g. all 3 highlighted edges on the right are
equivalent)