6th lecture - PowerPoint PPT Presentation

1 / 12

About This Presentation

Title:

6th lecture

Description:

Topics in Machine Learning. 2. Definition. Feedforward neural network (FNN) ... descent, can avoid small local optima due to the stochasticity; can also ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 13

Provided by: Barb403

Category:

more less

Transcript and Presenter's Notes

Title: 6th lecture

1
Topics in Machine Learning

6th lecture
Feedforward neural networks

2
Definition

Feedforward neural network (FNN)
set of neurons 1,...,N connected in a directed
acyclic graph, indicated as i?j,
weights wij for each i? j and bias bi for each i,
activation function fi for each neuron,
input neurons neurons without predecessors,
output neurons neurons without successors,
hidden neurons the others,
often layered structure multilayer FNNs.

3
Definition

such a network computes a function fwRn?Ro
depending on all weights w(wij,bi)ij

the neurons iteratively process the input x and
pass the output to the successors
w12
b2
x in Rn
fw(x) in Ro
o outputs
and get the output
hidden layer
copy the inputs
compute
n inputs
? blackboard
4
Network function

fw(x) is the vector (oi(x))i of the output
neurons i where the output oi(x) of neuron i is
defined recursively over the network structure
oi(x) xi for input neurons i (just copy the
input)
oi(x) fi(Sj?i wjioj(x) bi) (compute a
perceptron output based on the outputs from the
predecessors)
often fi(x) sgd (x) (1 exp(-x))-1
or fi(x) tanh(x)
... approximation
of H!

5
Training

Tasks regression or classification tasks fRn?Ro
Examples Pattern set (training set) with pattern
(xi,yi)
Goal minimize the deviation of predicted outputs
from the real ones
Principled idea of FNN-training
fix a network architecture (1 or 2 hidden layer,
5-50 neurons per layer)
optimize the parameters of the architecture such
that the quadratic error on the training set is
minimized
E(w) 0.5Si fw(xi)
yi2

E(w) small ? xi is approximately mapped to yi for
each i ?
6
Training

How is E(w) minimized? Usually, the activation
function is sgd.
gradient descent or better method (steepest
descent/conjugate descent, later (seminar) RProp)
How can we compute the gradient (derivative of
E(w) with respect to a weight wkl)?
W.l.o.g. biases as on-neurons consider just one
pattern, just one output neuron (derivative of a
sum is just the sum of the single derivatives)
consider the activation function sgd
Then
?E(w)/ ?wkl ? (0.5Si fw(xi) yi2)/ ?wkl
? (0.5(oo(x) y)2)/ ?wkl

7
Training

? (0.5(oo(x) y)2)/ ?wkl ()
1st try () (oo(x) - y) ?oo(x)/ ?wkl
per definition oi(x) xi for inputs, oi(x)
sgd(Sj?i wjioj(x) bi)
hence
?oi(x)/ ?wkl 0 for inputs,
?oi(x)/ ?wkl sgd(Sj?i
wjioj(x) bi)
(Sj?i
wji ?oj(x)/ ?wkl dklji oi(x))
this formula allows us to compute the
derivative ?oi(x)/ ?wkl iteratively starting from
the inputs to the outputs.
Effort O(W2)

8
Training

define netl Sj?l wjloj(x) bl
key issue ?E / ? wkl ?E / ?netl ?netl / ?
wkl ?E / ?netl ok(x)
2nd try
do ?(0.5(oo(x) y)2)/?neto
(oo(x) y)sgd(neto) for outputs
dl ?E / ?netl Sl?k ?E / ?netk ?netk /
?netl
Sl?k dk wlk sgd(netl)
? iterative computation from the outputs to the
inputs!
sgd(x) sgd(x) (1-sgd(x))

error signal dl
9
Training

Backpropagation (for one pattern)
efficient O(W) computation of the derivatives
compute the outputs
and the error signals
change

10
Training - remarks

online Backpropagation change the weights after
presenting just one pattern
? so-called stochastic gradient descent, can
avoid small local optima due to the
stochasticity can also be applied when the
pattern arrive online (e.g. roboter)
offline Backpropagation first accumulate the
changes for all patterns and then change the
weights
? improved variants such as steepest descent,
conjugate descent, RProp are offline

11
Further remarks

FNNs are universal approximators
every continuous function on compacta can be
approximated arbitrarily well by a FNN with one
hidden layer
for p patterns, po neurons are sufficient
Neural networks generalize to new data, the VC
dimension (? later lecture) is between W2 and W4
Pruning and growing algorithms are possible (?
seminar OBD, CC)
improved generalization with early stopping
(mirror the error on a test set and stop at its
minimum) or weight decay (decrease all weights in
every step)
alternative error functions instead of the
quadratic error can be used

12
Training pipeline
determine the architecture and learning
parameters via crossvalidation on the training set
optimize the parameters on the training set using
some modified backpropagation and early stopping,
weight decay, ...
split into training and test set
given data, preprocess the data normalization,
input reduction, feature extraction, ...
possibly further simplify the network using
pruning, ...
evaluate the network on the test set (not used
for training) ? this estimates the true error

Write a Comment

User Comments (0)