Title: Backpropagation in 2 hours
1Backpropagation in 2 hours
- Christer Johansson
- Computational Linguistics
- Bergen University
2Groundhog Day
- Very illustrative of how backprop works!
3Groundhog Day
- The main character gets to relive the same day
over and over until he gets it right. - This is the core of nn/backpropagation learning.
4How does one get it right?
The implication is that there is a correct way,
and that it ispossible to detect that you are
not doing the correct thing. The Bill Murray
character initially does not know what he is
supposed to do. Since it is an American romantic
comedy the goal is of course to get the girl.
As he gets to relive the same day he selects
behaviour that makes it more probable to get the
girl.
5How does one get it right?
In the movie he hasnt actually set a goal for
himself, but the same DATA SET keeps getting
presented to him. Depending on his reaction,
he gets feedback on different
sub-behaviour. At first, his search to minimize
his errors is more or less random. As he
acquires more and more correct reactions to the
data set, he also starts to see a goal (to get
the girl). (insight?) An artificial neural
network would never see the goal, but would have
gotten the girl anyway given the same data set.
If it didnt get stuck in a local minima - being
content with partying and have fun with all those
miss-not-so-rights.
6Learning
- Supervised learning
- Non-supervised learning
- Clustering of a data set
- Finding similarities
- Autonomous learning
- Selecting what to learn
7A perceptron neuron
Sum up the input and decide how much to output.
i1
i2
8The perceptron learning law
Each new weight is the old weight
adjusted by the error times the
input value. Example out 0 desired 1
error (0 - -1) 1 in 1 weight
(e.g. 0.1) New weight is 0.1 error 1.1 A
threshold value can be thought of as an input
that is always on.It will be adjustable if it
has a weight associated with it.
9Summing up
The perceptron basically sums up all its input
w1x1 w2x2 wnxn ( C ) Is that sum
larger than a threshold value?
10Drawing a line
The decision for on or off is bounded by a linear
function For example, with two inputs,
w1i1 w2i2 threshold value (0 in the
previous example) i.e. ax by - C 0 (this is
the equation of a line y (-a/b)xC)
1
1
1
1
0
1
0
0
1
0
0
0
0 1
0 1
11A perceptron or
If sum gt 0.5 then at least one input on so
output 1
w11
w21
i1
i2
12A perceptron and
If sum gt 1.5 then both inputs on so output 1
w11
w21
i1
i2
13A problem
Some functions are not linearly separable.
1
1
0
0
0
1
0 1
14The Dark Ages of NN
Since xor (which is a simple function) could not
be separated by a line the perceptron is very
limited in what kind of functions it can learn.
This discredited the inventor of the
perceptron, Frank Rosenblatt, and made it almost
impossible to get funding for neural network
research. Funding instead went to Symbolic AI.
Neural networks / perceptrons seemed a
dead end. 1967 1982 quiet years, though many
worked underground.
15Combining
xor can be composed of simpler logical
functions. A xor B (A or B) and not (A and
B) The last term simply removes the troublesome
value.
16A combined perceptron
OR
w3-1
w41
w51
AND
w11
w21
i1
i2
17What was the problem?
There was no learning law for layered
perceptrons. Nobody bothered to find one! The
main problem was that the perceptron solves a
simple linear equation. Any straight
combination (adding or multiplying) of linear
equations is still a linear equation.
18A non-linear activation
- Giving the neuron a non-linear activation
function - Made it possible to combine neurons into
layered networks - The sum was more than the sum of the parts
19Summing up in multi-layeredperceptrons
Recall The perceptron summed up all its input
and determined if that sum was larger than a
threshold value. Now The sum of all inputs is
converted into a range of outputs 0..1 The
inputs can be large 01weight (importance to
activation) as weights can be -L, L where L
is a large number. One sophisticated summation
use a sigmoid function 1/(1e-sum(in)) as
sum(in) gets large the function approaches 1/(1
0) --gt 1 as sum(in) gets large negative
1/L --gt 0 as sum(in) near 0
the sigmoid is near 0.5.
20Outline of Algorithm
- Initialize weights and thresholds. Weights
are the connection strength between neurons.
Start with small random values. - Present pairs of input and desired output.
- Calculate ACTUAL outputs.
21Outline of Algorithm
- Adapt weights
- Weights to neurons that contributed to
correctoutput are strengthened. - Weights to neurons that counteract correct output
are weakened. - Keep presenting data and adapt weights until
criterion is met.
22Backpropagation
- The discrepancy between the desired and actual
output is called the error. - Neurons in the previous layer are blamed or
rewarded according to their contribution to the
error. - The error is propagated back to the input.
- Each previous neuron can be blamed if it helped
to activate a neuron that was to blame. - Or it can be rewarded if it helped a rewarded
neuron.
23Backpropagation of errors
The errors are propagated backwards towards where
they originated. The weights of the hidden layer
is updated for each neuron How much of the
error did it contribute? How sure was it of
itself? Unsure neurons are close to 0.5 and
should be updated the most fh(1-h) is
good for this purpose if the neuron is sure of
its activity then ( h--gt1 or
h--gt0 ) gt f--gt0 if
maximally unsure (0.5) then f0.50.50.25 Usual
ly only a fraction of the error is used for
updates.
24derivate
fh(1-h) is also the derivate of the
logistic function, which is used to decide the
activity of the neuron. The derivate of a
function tells how fast and in which
direction the function is changing. If the
derivate is small then we can only make small
changes, as we are likely close to a (possibly
local) minimum (of the errors). We want to find
the global minimum of errors. There are many
heuristics, to accomplish this quickly but no
practical fail proof rule.
25Enough maths
So far we have looked at how a very simple model
of a neuronis capable of learning by adjusting
the weights by which it activates other neurons.
It does
this by comparing what was desired
of its
activity with what it actually did. The desired
values is provided by a teacher signal.
Error feedback --- principled blame assignment
26Supervised Learning
- Teacher signal
- Evaluation of the teacher signal - creates an
error which can be used to minimise errors in the
future. - Iterative process
- Data set
- Multiple presentation
- Different errors will be generated
27Non linearly separable problems
Types of Decision Regions
Exclusive-OR Problem
Classes with Meshed regions
Most General Region Shapes
Structure
Single-Layer
Half Plane Bounded By Hyperplane
Two-Layer
Convex Open Or Closed Regions
Abitrary (Complexity Limited by No. of Nodes)
Three-Layer
Neural Networks An Introduction Dr. Andrew
Hunter
28Other problems
29Sets of solutions one-to-many
Imagine we trained a network onsteering a car
along a winding road.The road turns right and
left an equal number of times. The network
would find a solutionthat minimizes the expected
errors.
0 steer to left 0.5nothing 1 steer to
right
30Sets of solutions one-to-many
What happens if we find a fork in the road? We
get an equal signal for turn left,and turn
right. The network minimizethe error by
compromise (left right)/2 nothing The
car gets wrecked.
31Other problems
Spurious regularities. - Does training data
represent the problem -- and nothing
but the problem? The net is minimizing errors.
- there are many sources of errors. - errors
are necessary for learning The most consistent
regularitieswill be discovered first. - may
block the problem we want solved.
32Catastrophic forgetting
There are no guarantees that something which
hasbeen acquired will not beforgotten when a
newpattern is learned.
33False Memories
Interaction between indatamay cause false
memoriesi.e., some untrained patternswill
emerge very stronglysometimes stronger than
trainedpatterns. Depends on data set, error
function,presentation order, etc.