Title: Error backpropagation technical
1Error back-propagation (technical)
2Deriving back-prop algorithm for computing
derivatives
- Will now derive back-prop algorithm for a general
network with arbitrary feed-forward structure - Resulting formulae will then be illustrated using
simple layered network structure with - a single layer of sigmoidal hidden units, and,
- a sum-of-squares error
3Feedforward networks
4First, what does each unit (artificial neuron) do?
- In a general feed-forward network, each unit j
computes a weighted sum of its input
Could be xi if its a network input
(4.26)
Activation (output) of a unit, or input, i, which
sends a connection to unit j
Summation over all units i that send connections
to unit j
Weight assoc. with that connection (from unit i
to unit j)
5What else?
- The sum a is then transformed by a (usually
nonlinear) activation function g( ) (such as a
sigmoid) to give the activation zj of the form
Could be yj if its a network output
6Error function
- We seek to determine suitable weights to minimize
some appropriate error function - Will consider error functions that can be written
as a sum, over all patterns in the training set,
of an error defined for each pattern separately
Nearly all practical error functions can be
written in this form
(4.28)
An index, not a power!
Pattern index
7Error function (contd)
- Well also assume that the error can be written
as a function of the network output variables,
the yis
(4.29)
8Our main goal (reminder)
- To find the derivatives of the error E with
respect to the weights (and biases) in the
network - But from (4.28) we just need to know how to find
the derivative of E for 1 pattern (at a time) and
add up the results - Thus we can just consider 1 pattern from now on
- Can drop the pattern index n
9Consider evaluating the derivative of E (En) wrt
wji
- Note that E depends on weight wji only via the
summed input aj to unit j. - Thus, we apply the chain rule to get
(4.30-4.33)
Chain rule
Follows from def of aj
10More on derivative of E Are we done?
We just derived
Hebbian since it correlates input and output
11Easy for output units!
(4.34)
Chain rule
Activation function, g is yk as a function of ak
12(4.35)
Using def of d
Chain rule, once again!
Sum runs over all units k to which unit j sends
connections Note that units k are closer to the
network output than unit j
13(No Transcript)
14We already had.
(4.35)
(4.26), (4,27), re-indexed
Thus.
Sum over units feeding into unit k, including
unit j
All partial derivatives are 0 except when xj
15All partial derivatives are 0 except when xj
We just derived
16Summary of back-propagation
- Apply an input vector x to the network and find
the activations (outputs) of all neurons - This is forward propagation (4.26, 27)
- Evaluate dk for all output units (4.34)
- Back-propagate the ds (4.36) to obtain djs for
hidden units - Use (4.33) to evaluate the required derivatives
- Derivative of total error E can be obtained by
repeating the above steps for each pattern in the
training set and summing over all patterns