Connectionist Models: Backprop - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Connectionist Models: Backprop

Description:

Use a continuous, differentiable activation function (Sigmoid) ... Use a sigmoid squashing function ... Replace Step with Sigmoid (differentiable) function ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 42
Provided by: SriniNa2
Category:

less

Transcript and Presenter's Notes

Title: Connectionist Models: Backprop


1
Connectionist Models Backprop
  • Jerome Feldman
  • CS182/CogSci110/Ling109
  • Spring 2008

2
Recruiting connections
  • Given that LTP involves synaptic strength changes
    and Hebbs rule involves coincident-activation
    based strengthening of connections
  • How can connections between two nodes be
    recruited using Hebbss rule?

3
X
Y
4
X
Y
5
Finding a Connection
P (1-F) BK
  • P Probability of NO link between X and Y
  • N Number of units in a layer
  • B Number of randomly outgoing units per unit
  • F B/N , the branching factor
  • K Number of Intermediate layers, 2 in the
    example

N
106 107
108
K
Paths (1-P k-1)(N/F) (1-P k-1)B
6
Finding a Connection in Random Networks
For Networks with N nodes and branching
factor, there is a high probability of finding
good links. (Valiant 1995)
7
Recruiting a Connection in Random Networks
  • Informal Algorithm
  • Activate the two nodes to be linked
  • Have nodes with double activation strengthen
    their active synapses (Hebb)
  • There is evidence for a now print signal based
    on LTP (episodic memory)

8
(No Transcript)
9
(No Transcript)
10
Has-color
Has-shape
Green
Round
11
Has-color
Has-shape
GREEN
ROUND
12
Hebbs rule is not sufficient
  • What happens if the neural circuit fires
    perfectly, but the result is very bad for the
    animal, like eating something sickening?
  • A pure invocation of Hebbs rule would strengthen
    all participating connections, which cant be
    good.
  • On the other hand, it isnt right to weaken all
    the active connections involved much of the
    activity was just recognizing the situation we
    would like to change only those connections that
    led to the wrong decision.
  • No one knows how to specify a learning rule that
    will change exactly the offending connections
    when an error occurs.
  • Computer systems, and presumably nature as well,
    rely upon statistical learning rules that tend to
    make the right changes over time. More in later
    lectures.

13
Hebbs rule is insufficient
  • should you punish all the connections?

14
Models of Learning
  • Hebbian coincidence
  • Supervised correction (backprop)
  • Recruitment one-trial
  • Reinforcement Learning- delayed reward
  • Unsupervised similarity

15
Abbstract Neuron
Threshold Activation Function
16
Boolean XOR
o
h2
h1
x2
x1
17
Supervised Learning - Backprop
  • How do we train the weights of the network
  • Basic Concepts
  • Use a continuous, differentiable activation
    function (Sigmoid)
  • Use the idea of gradient descent on the error
    surface
  • Extend to multiple layers

18
Backprop
  • To learn on data which is not linearly separable
  • Build multiple layer networks (hidden layer)
  • Use a sigmoid squashing function instead of a
    step function.

19
Tasks
  • Unconstrained pattern classification
  • Credit assessment
  • Digit Classification
  • Speech Recognition
  • Function approximation
  • Learning control
  • Stock prediction

20
Sigmoid Squashing Function
o u t p u t
w0
w2
wn
w1
y01
. . .
y2
yn
y1
i n p u t
21
The Sigmoid Function
ya
xnet
22
The Sigmoid Function
Output1
ya
Output0
xneti
23
The Sigmoid Function
Output1
Sensitivity to input
ya
Output0
xnet
24
Nice Property of Sigmoids
25
Gradient Descent
26
Gradient Descent on an error
27
Learning as Gradient Descent
Complex error surface for hypothetical network
training problem
Error surface for a 2-wt, linear network
28
Learning Rule Gradient Descent on an Root Mean
Square (RMS)
  • Learn wis that minimize squared error

29
Gradient Descent
30
Gradient Descent
global mimimum this is your goal
it should be 4-D (3 weights) but you get the idea
31
Backpropagation Algorithm
  • Generalization to multiple layers and multiple
    output units

32
Back-Propagation Algorithm
Sigmoid
  • We define the error term for a single node to be
    ti - yi

33
Backprop Details
  • Here we go

34
The output layer
The derivative of the sigmoid is just
35
Nice Property of Sigmoids
36
The hidden layer
37
Lets just do an example
0
0.8
1/(1e-0.5)
0.6
0
0.5
E Error ½ ?i (ti yi)2
0.6224
E ½ (t0 y0)2
0.5
E ½ (0 0.6224)2 0.1937
38
An informal account of BackProp
After amassing Dw for all weights and, change
each wt a little bit, as determined by the
learning rate
39
Backprop learning algorithm(incremental-mode)
  • n1
  • initialize w(n) randomly
  • while (stopping criterion not satisfied and
    nltmax_iterations)
  • for each example (x,d)
  • - run the network with input x and compute the
    output y
  • - update the weights in backward order starting
    from those of the output layer
  • with computed using the (generalized)
    Delta rule
  • end-for
  • n n1
  • end-while

40
Backpropagation Algorithm
  • Initialize all weights to small random numbers
  • For each training example do
  • For each hidden unit h
  • For each output unit k
  • For each output unit k
  • For each hidden unit h
  • Update each network weight wij

with
41
Backpropagation Algorithm
42
What if all the input To hidden node weights are
initially equal?
43
Momentum term
  • The speed of learning is governed by the learning
    rate.
  • If the rate is low, convergence is slow
  • If the rate is too high, error oscillates without
    reaching minimum.
  • Momentum tends to smooth small weight error
    fluctuations.

the momentum accelerates the descent in steady
downhill directions. the momentum has a
stabilizing effect in directions that oscillate
in time.
44
Convergence
  • May get stuck in local minima
  • Weights may diverge
  • but often works well in practice
  • Representation power
  • 2 layer networks any continuous function
  • 3 layer networks any function

45
Pattern Separation and NN architecture
46
Local Minimum
USE A RANDOM COMPONENT SIMULATED ANNEALING
47
Adjusting Learning Rate and the Hessian
  • The Hessian H is the second derivative of E with
    respect to w.
  • The Hessian, tells you about the shape of the
    cost surface
  • The eigenvalues of H are a measure of the
    steepness of the surface along the curvature
    directions.
  • a large eigenvalue gt steep curvature gt need
    small learning rate
  • the learning rate should be proportional to
    1/eigenvalue

48
Overfitting and generalization
TOO MANY HIDDEN NODES TENDS TO OVERFIT
49
Stopping criteria
  • Sensible stopping criteria
  • total mean squared error change Back-prop is
    considered to have converged when the absolute
    rate of change in the average squared error per
    epoch is sufficiently small (in the range 0.01,
    0.1).
  • generalization based criterion After each
    epoch the NN is tested for generalization. If the
    generalization performance is adequate then stop.
    If this stopping criterion is used then the part
    of the training set used for testing the network
    generalization will not be used for updating the
    weights.

50
Overfitting in ANNs
51
Early Stopping (Important!!!)
  • Stop training when error goes up on validation
    set

52
Stopping criteria
  • Sensible stopping criteria
  • total mean squared error change Back-prop is
    considered to have converged when the absolute
    rate of change in the average squared error per
    epoch is sufficiently small (in the range 0.01,
    0.1).
  • generalization based criterion After each
    epoch the NN is tested for generalization. If the
    generalization performance is adequate then stop.
    If this stopping criterion is used then the part
    of the training set used for testing the network
    generalization will not be used for updating the
    weights.

53
Architectural Considerations
What is the right size network for a given
job? How many hidden units? Too many no
generalization Too few no solution Possible
answer Constructive algorithm, e.g. Cascade
Correlation (Fahlman, Lebiere 1990)
etc
54
Network Topology
  • The number of layers and of neurons depend on the
    specific task. In practice this issue is solved
    by trial and error.
  • Two types of adaptive algorithms can be used
  • start from a large network and successively
    remove some nodes and links until network
    performance degrades.
  • begin with a small network and introduce new
    neurons until performance is satisfactory.

55
Cascade Correlation
  • It starts with a minimal network, consisting only
    of an input and an output layer.
  • Minimizing the overall error of a net, it adds
    step by step new hidden units to the hidden
    layer.
  • Cascade-Correlation is a supervised learning
    architecture which builds a near minimal
    multi-layer network topology.
  • The two advantages of this architecture are that
  • there is no need for a user to worry about the
    topology of the network, and that
  • Cascade-Correlation learns much faster than the
    usual learning algorithms.

56
Supervised vs Unsupervised Learning
  • Backprop requires a 'target'
  • how realistic is that?
  • Hebbian learning is unsupervised, but limited in
    power
  • How can we combine the power of backprop (and
    friends) with the ideal of unsupervised learning?

57
Autoassociative Networks
  • Network trained to reproduce the input at the
    output layer
  • Non-trivial if number of hidden units is smaller
    than inputs/outputs
  • Forced to develop compressed representations of
    the patterns
  • Hidden unit representations may reveal natural
    kinds (e.g. Vowels vs Consonants)
  • Problem of explicit teacher is circumvented

copy of input as target
input
58
Problems and Networks
  • Some problems have natural "good" solutions
  • Solving a problem may be possible by providing
    the right armory of general-purpose tools, and
    recruiting them as needed
  • Networks are general purpose tools.
  • Choice of network type, training, architecture,
    etc greatly influences the chances of
    successfully solving a problem
  • Tension Tailoring tools for a specific job Vs
    Exploiting general purpose learning mechanism

59
Summary
  • Multiple layer feed-forward networks
  • Replace Step with Sigmoid (differentiable)
    function
  • Learn weights by gradient descent on error
    function
  • Backpropagation algorithm for learning
  • Avoid overfitting by early stopping

60
ALVINN drives 70mph on highways
61
Use MLP Neural Networks when
  • (vectored) Real inputs, (vectored) real outputs
  • Youre not interested in understanding how it
    works
  • Long training times acceptable
  • Short execution (prediction) times required
  • Robust to noise in the dataset

62
Applications of FFNN
  • Classification, pattern recognition
  • FFNN can be applied to tackle non-linearly
    separable learning problems.
  • Recognizing printed or handwritten characters,
  • Face recognition
  • Classification of loan applications into
    credit-worthy and non-credit-worthy groups
  • Analysis of sonar radar to determine the nature
    of the source of a signal
  • Regression and forecasting
  • FFNN can be applied to learn non-linear functions
    (regression) and in particular functions whose
    inputs is a sequence of measurements over time
    (time series).

63
(No Transcript)
64
Extensions of Backprop Nets
  • Recurrent Architectures
  • Backprop through time

65
Elman Nets Jordan Nets
a
  • Updating the context as we receive input
  • In Jordan nets we model forgetting as well
  • The recurrent connections have fixed weights
  • You can train these networks using good ol
    backprop

66
Recurrent Backprop
a
b
c
a
b
c
w2
w4
unrolling 3 iterations
a
b
c
a
b
c
w1
w3
w1
w2
w3
w4
a
b
c
  • well pretend to step through the network one
    iteration at a time
  • backprop as usual, but average equivalent weights
    (e.g. all 3 highlighted edges on the right are
    equivalent)
Write a Comment
User Comments (0)
About PowerShow.com