Network Training: The Gradient Descent Method

About This Presentation

Title:

Network Training: The Gradient Descent Method

Description:

Illustration 2: the current state. w1. w0. E(w) Illustration3: the descent direction ... Illustration 4: one-step learning. w1. w0. E(w) The starting point in ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 20

Provided by: AndyPhil8

Category:

more less

Transcript and Presenter's Notes

Title: Network Training: The Gradient Descent Method

1
Network TrainingThe Gradient Descent Method
2
Learning as an Optimization Problem

Minimizing the error function by optimizing
parameters
Eg. the sum of square error

i the index of training example

All examplestraining set testing set

The performance of the learning algorithm
-The convergence
-The speed

3
A standard strategy Gradient Descent(1)

How to update optimize w based on E(w)?
The gradient of the error function
An important observation at a point

4
Gradient Descent (2)

The learning rule w new w old - h E(w)
where h is the learning rate (small positive
number) t index learning
steps.

D
5
Illustration 1 the landscape of error function
E(w)
w1
w0
6
Illustration 2 the current state
E(w)
w1
w0
7
Illustration3 the descent direction
E(w)
w1
Direction descent direction of the negative
gradient
w0
8
Illustration 4 one-step learning
E(w)
w1
The starting point in the weight space
New point after learning
w0
9
A formal justification

Taylor expansion
After one-step learning, for h sufficiently small

10
Gradient Descent the Algorithm

Step 0 Choosing the learning rate h and the
initial value of w(0)
Step 1 at the learning step t
Calculate E(w) at w(t-1)
note this can be done numerically or
analytically
update all the weights
w(t)w(t-1) h E(w(t-1))
Step 2 check if E(w) reaches the minimum
The change of E(w) after one-step learning is
smaller than a tolerable value.
if so, stop
if not, go back to step 1

D
D
11
Presenting Examples

Sequential mode evaluating examples one-by-one
Batch mode evaluating E(w) based on all examples

12
An example

The input

The network output

The sum of square error

The gradients

13
The effect of learning rate

The learning rate controls how large to move
along the gradient direction.
The effect of h
If h is too small, learning is slow (slow
moving)
If h is too large, learning is slow
(oscillation).
The idea of momentum

14
Beyond the standard gradient descent

Natural gradient (Amari)
Conjugate gradient
Newtons method
An intrinsic defict of the gradient type of
methods local minimums

15
Perceptron learning (1)

Note that
The error function perceptron criterion

16
Perceptron Learning (2)

Pattern-by-pattern gradient descent rule
h1 can be used.
The algorithm is guaranteed to converge in a
finite number of steps, provide the data set is
linearly separable.

17
Initial Step, x(3) is misclassified
X(1)
W(0)
X(2)
X(3)
X(4)
18
W(1)
X(1)
-X(3)
W(0)
X(2)
X(3)
X(4)
First step, x(3) is correctly classified, But
x(2) becomes misclassified.
19
W(2)
X(2)
W(1)
X(1)
X(2)
X(3)
All patterns are correctly classified.
X(4)

Write a Comment

User Comments (0)