Title: 6th lecture
1Topics in Machine Learning
- 6th lecture
- Feedforward neural networks
2Definition
- Feedforward neural network (FNN)
- set of neurons 1,...,N connected in a directed
acyclic graph, indicated as i?j, - weights wij for each i? j and bias bi for each i,
- activation function fi for each neuron,
- input neurons neurons without predecessors,
- output neurons neurons without successors,
- hidden neurons the others,
- often layered structure multilayer FNNs.
3Definition
- such a network computes a function fwRn?Ro
depending on all weights w(wij,bi)ij
the neurons iteratively process the input x and
pass the output to the successors
w12
b2
x in Rn
fw(x) in Ro
o outputs
and get the output
hidden layer
copy the inputs
compute
n inputs
? blackboard
4Network function
- fw(x) is the vector (oi(x))i of the output
neurons i where the output oi(x) of neuron i is
defined recursively over the network structure - oi(x) xi for input neurons i (just copy the
input) - oi(x) fi(Sj?i wjioj(x) bi) (compute a
perceptron output based on the outputs from the
predecessors) - often fi(x) sgd (x) (1 exp(-x))-1
- or fi(x) tanh(x)
- ... approximation
of H!
5Training
- Tasks regression or classification tasks fRn?Ro
- Examples Pattern set (training set) with pattern
(xi,yi) - Goal minimize the deviation of predicted outputs
from the real ones - Principled idea of FNN-training
- fix a network architecture (1 or 2 hidden layer,
5-50 neurons per layer) - optimize the parameters of the architecture such
that the quadratic error on the training set is
minimized - E(w) 0.5Si fw(xi)
yi2
E(w) small ? xi is approximately mapped to yi for
each i ?
6Training
- How is E(w) minimized? Usually, the activation
function is sgd. - gradient descent or better method (steepest
descent/conjugate descent, later (seminar) RProp) - How can we compute the gradient (derivative of
E(w) with respect to a weight wkl)? - W.l.o.g. biases as on-neurons consider just one
pattern, just one output neuron (derivative of a
sum is just the sum of the single derivatives)
consider the activation function sgd - Then
- ?E(w)/ ?wkl ? (0.5Si fw(xi) yi2)/ ?wkl
? (0.5(oo(x) y)2)/ ?wkl
7Training
- ? (0.5(oo(x) y)2)/ ?wkl ()
- 1st try () (oo(x) - y) ?oo(x)/ ?wkl
- per definition oi(x) xi for inputs, oi(x)
sgd(Sj?i wjioj(x) bi) - hence
- ?oi(x)/ ?wkl 0 for inputs,
- ?oi(x)/ ?wkl sgd(Sj?i
wjioj(x) bi) - (Sj?i
wji ?oj(x)/ ?wkl dklji oi(x)) - this formula allows us to compute the
derivative ?oi(x)/ ?wkl iteratively starting from
the inputs to the outputs. - Effort O(W2)
8Training
- define netl Sj?l wjloj(x) bl
- key issue ?E / ? wkl ?E / ?netl ?netl / ?
wkl ?E / ?netl ok(x) - 2nd try
- do ?(0.5(oo(x) y)2)/?neto
(oo(x) y)sgd(neto) for outputs - dl ?E / ?netl Sl?k ?E / ?netk ?netk /
?netl - Sl?k dk wlk sgd(netl)
- ? iterative computation from the outputs to the
inputs! - sgd(x) sgd(x) (1-sgd(x))
error signal dl
9Training
- Backpropagation (for one pattern)
- efficient O(W) computation of the derivatives
- compute the outputs
- and the error signals
- change
10Training - remarks
- online Backpropagation change the weights after
presenting just one pattern - ? so-called stochastic gradient descent, can
avoid small local optima due to the
stochasticity can also be applied when the
pattern arrive online (e.g. roboter) - offline Backpropagation first accumulate the
changes for all patterns and then change the
weights - ? improved variants such as steepest descent,
conjugate descent, RProp are offline
11Further remarks
- FNNs are universal approximators
- every continuous function on compacta can be
approximated arbitrarily well by a FNN with one
hidden layer - for p patterns, po neurons are sufficient
- Neural networks generalize to new data, the VC
dimension (? later lecture) is between W2 and W4 - Pruning and growing algorithms are possible (?
seminar OBD, CC) - improved generalization with early stopping
(mirror the error on a test set and stop at its
minimum) or weight decay (decrease all weights in
every step) - alternative error functions instead of the
quadratic error can be used
12Training pipeline
determine the architecture and learning
parameters via crossvalidation on the training set
optimize the parameters on the training set using
some modified backpropagation and early stopping,
weight decay, ...
split into training and test set
given data, preprocess the data normalization,
input reduction, feature extraction, ...
possibly further simplify the network using
pruning, ...
evaluate the network on the test set (not used
for training) ? this estimates the true error