Error BackPropagation intro

About This Presentation

Title:

Description:

Number of Views:25

Avg rating:3.0/5.0

Slides: 11

Provided by: engi144

Category:

Tags: backpropagation | error | intro | mosfet

Transcript and Presenter's Notes

Title: Error BackPropagation intro

1
Error Back-Propagation (intro)

2
Background and motivation

So far we concentrated on the representational
capabilities of multilayer networks
Next we see how such a network can learn a
suitable mapping from a given data set

3
Network with threshold units The final (output)
layer

The final layer of weights can be regarded as a
perceptron with inputs given by the outputs of
the last layer of hidden units
These weights can therefore be updated using the
Perceptron Learning Rules of chapter 3
If the presented pattern is misclassified, update
the weight vector with the product of target
output and input

4
Problems with the hidden layers

Cant update weights to these layers with the
perceptron rule
Because we dont know what the target outputs (of
the hidden layers) are supposed to be!!
In other words, suppose you input a pattern to
the entire network, and get a misclassified
output to the entire network cant tell which
hidden layer units have the wrong output!

5
Hidden layer problem (contd)

In fact maybe all the hidden layer units give the
right output, but the output unit doesnt!
This is known as the credit assignment problem
Shouldve been known as the fault assignment
problem!
Dont know which hidden units (or output unit)
are guilty of giving the wrong output
So dont know which weights to adjust, and by how
much

6
Solution to the Credit Assignment Problem

Is relatively simple! (in principle)
Hard to read at first
Consider network with differentiable activation
functions
Then the output becomes a differentiable function
of all the input variables and all the weights
and biases
Note that operations like taking the vector
product are already differentiable
Also, a composition of differentiable functions
is differentiable

7
What good is differentiability?

Ok, we chose a differentiable activation function
(like a sigmoid), so that the output is a
differentiable function of all the weights
Now suppose the error function is a
differentiable function of the output
E.g. sum-of-squares error (Chapter 1)
Then we can evaluate the derivatives of the error
wrt the weights
Finally, we use these derivatives to find weights
that minimize the error!!
Gradient descent
Other techniques
We have one that makes use of a computer as a
decision maker rather than just performing
iterations according to a math formula

8
Back-propagation

Is a technique for evaluating the derivatives of
the error function wrt weights
Really just an application of the chain rule
(with partial derivatives), but made nicely
canonical for programming convenience
Name came from a propagation of errors
backwards through the network
Popularized in a paper by Rumelhart, Hinton and
Williams (1986)
Similar ideas developed earlier by a number of
researchers including Werbos (1974) and Parker
(1985)

9
Meaning of the term back-propagation

Term is used to mean different things
The original meaning was as we just saw
Propagation of errors back through network for
use in computing the derivative of error wrt
weights
The multi-layer perceptron is sometimes called a
back-propagation network
Term is also used to refer to the training of a
multi-layer perceptron using Gradient Descent
To clarify the terminology it is useful to
consider the nature of the training process more
carefully

10
The two stages of a training iteration

Weight training is done in two distinct stages
Computing derivatives of the error function wrt
weights
Back-prop happens in this stage, so in this book
well use the term to mean just this
back-propagation of errors
These derivatives are then used to compute the
weight adjustments
This can be done in many ways, including gradient
desent