6th lecture - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

6th lecture

Description:

Topics in Machine Learning. 2. Definition. Feedforward neural network (FNN) ... descent, can avoid small local optima due to the stochasticity; can also ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 13
Provided by: Barb403
Category:
Tags: 6th | lecture | optima

less

Transcript and Presenter's Notes

Title: 6th lecture


1
Topics in Machine Learning
  • 6th lecture
  • Feedforward neural networks

2
Definition
  • Feedforward neural network (FNN)
  • set of neurons 1,...,N connected in a directed
    acyclic graph, indicated as i?j,
  • weights wij for each i? j and bias bi for each i,
  • activation function fi for each neuron,
  • input neurons neurons without predecessors,
  • output neurons neurons without successors,
  • hidden neurons the others,
  • often layered structure multilayer FNNs.

3
Definition
  • such a network computes a function fwRn?Ro
    depending on all weights w(wij,bi)ij

the neurons iteratively process the input x and
pass the output to the successors
w12
b2
x in Rn
fw(x) in Ro
o outputs
and get the output
hidden layer
copy the inputs
compute
n inputs
? blackboard
4
Network function
  • fw(x) is the vector (oi(x))i of the output
    neurons i where the output oi(x) of neuron i is
    defined recursively over the network structure
  • oi(x) xi for input neurons i (just copy the
    input)
  • oi(x) fi(Sj?i wjioj(x) bi) (compute a
    perceptron output based on the outputs from the
    predecessors)
  • often fi(x) sgd (x) (1 exp(-x))-1
  • or fi(x) tanh(x)
  • ... approximation
    of H!

5
Training
  • Tasks regression or classification tasks fRn?Ro
  • Examples Pattern set (training set) with pattern
    (xi,yi)
  • Goal minimize the deviation of predicted outputs
    from the real ones
  • Principled idea of FNN-training
  • fix a network architecture (1 or 2 hidden layer,
    5-50 neurons per layer)
  • optimize the parameters of the architecture such
    that the quadratic error on the training set is
    minimized
  • E(w) 0.5Si fw(xi)
    yi2

E(w) small ? xi is approximately mapped to yi for
each i ?
6
Training
  • How is E(w) minimized? Usually, the activation
    function is sgd.
  • gradient descent or better method (steepest
    descent/conjugate descent, later (seminar) RProp)
  • How can we compute the gradient (derivative of
    E(w) with respect to a weight wkl)?
  • W.l.o.g. biases as on-neurons consider just one
    pattern, just one output neuron (derivative of a
    sum is just the sum of the single derivatives)
    consider the activation function sgd
  • Then
  • ?E(w)/ ?wkl ? (0.5Si fw(xi) yi2)/ ?wkl
    ? (0.5(oo(x) y)2)/ ?wkl

7
Training
  • ? (0.5(oo(x) y)2)/ ?wkl ()
  • 1st try () (oo(x) - y) ?oo(x)/ ?wkl
  • per definition oi(x) xi for inputs, oi(x)
    sgd(Sj?i wjioj(x) bi)
  • hence
  • ?oi(x)/ ?wkl 0 for inputs,
  • ?oi(x)/ ?wkl sgd(Sj?i
    wjioj(x) bi)
  • (Sj?i
    wji ?oj(x)/ ?wkl dklji oi(x))
  • this formula allows us to compute the
    derivative ?oi(x)/ ?wkl iteratively starting from
    the inputs to the outputs.
  • Effort O(W2)

8
Training
  • define netl Sj?l wjloj(x) bl
  • key issue ?E / ? wkl ?E / ?netl ?netl / ?
    wkl ?E / ?netl ok(x)
  • 2nd try
  • do ?(0.5(oo(x) y)2)/?neto
    (oo(x) y)sgd(neto) for outputs
  • dl ?E / ?netl Sl?k ?E / ?netk ?netk /
    ?netl
  • Sl?k dk wlk sgd(netl)
  • ? iterative computation from the outputs to the
    inputs!
  • sgd(x) sgd(x) (1-sgd(x))

error signal dl
9
Training
  • Backpropagation (for one pattern)
  • efficient O(W) computation of the derivatives
  • compute the outputs
  • and the error signals
  • change

10
Training - remarks
  • online Backpropagation change the weights after
    presenting just one pattern
  • ? so-called stochastic gradient descent, can
    avoid small local optima due to the
    stochasticity can also be applied when the
    pattern arrive online (e.g. roboter)
  • offline Backpropagation first accumulate the
    changes for all patterns and then change the
    weights
  • ? improved variants such as steepest descent,
    conjugate descent, RProp are offline

11
Further remarks
  • FNNs are universal approximators
  • every continuous function on compacta can be
    approximated arbitrarily well by a FNN with one
    hidden layer
  • for p patterns, po neurons are sufficient
  • Neural networks generalize to new data, the VC
    dimension (? later lecture) is between W2 and W4
  • Pruning and growing algorithms are possible (?
    seminar OBD, CC)
  • improved generalization with early stopping
    (mirror the error on a test set and stop at its
    minimum) or weight decay (decrease all weights in
    every step)
  • alternative error functions instead of the
    quadratic error can be used

12
Training pipeline
determine the architecture and learning
parameters via crossvalidation on the training set
optimize the parameters on the training set using
some modified backpropagation and early stopping,
weight decay, ...
split into training and test set
given data, preprocess the data normalization,
input reduction, feature extraction, ...
possibly further simplify the network using
pruning, ...
evaluate the network on the test set (not used
for training) ? this estimates the true error
Write a Comment
User Comments (0)
About PowerShow.com