Title: Network Training: The Gradient Descent Method
1Network TrainingThe Gradient Descent Method
2Learning as an Optimization Problem
- Minimizing the error function by optimizing
parameters - Eg. the sum of square error
i the index of training example
- All examplestraining set testing set
- The performance of the learning algorithm
- -The convergence
- -The speed
3A standard strategy Gradient Descent(1)
- How to update optimize w based on E(w)?
- The gradient of the error function
- An important observation at a point
-
4Gradient Descent (2)
- The learning rule w new w old - h E(w)
-
- where h is the learning rate (small positive
number) t index learning - steps.
D
5Illustration 1 the landscape of error function
E(w)
w1
w0
6Illustration 2 the current state
E(w)
w1
w0
7Illustration3 the descent direction
E(w)
w1
Direction descent direction of the negative
gradient
w0
8Illustration 4 one-step learning
E(w)
w1
The starting point in the weight space
New point after learning
w0
9A formal justification
- Taylor expansion
- After one-step learning, for h sufficiently small
10Gradient Descent the Algorithm
- Step 0 Choosing the learning rate h and the
initial value of w(0) - Step 1 at the learning step t
- Calculate E(w) at w(t-1)
- note this can be done numerically or
analytically - update all the weights
- w(t)w(t-1) h E(w(t-1))
- Step 2 check if E(w) reaches the minimum
- The change of E(w) after one-step learning is
smaller than a tolerable value. - if so, stop
- if not, go back to step 1
D
D
11Presenting Examples
- Sequential mode evaluating examples one-by-one
- Batch mode evaluating E(w) based on all examples
12An example
13The effect of learning rate
- The learning rate controls how large to move
along the gradient direction. - The effect of h
- If h is too small, learning is slow (slow
moving) - If h is too large, learning is slow
(oscillation). - The idea of momentum
14Beyond the standard gradient descent
- Natural gradient (Amari)
- Conjugate gradient
- Newtons method
- An intrinsic defict of the gradient type of
methods local minimums
15Perceptron learning (1)
- Note that
- The error function perceptron criterion
16Perceptron Learning (2)
- Pattern-by-pattern gradient descent rule
-
- h1 can be used.
- The algorithm is guaranteed to converge in a
finite number of steps, provide the data set is
linearly separable.
17Initial Step, x(3) is misclassified
X(1)
W(0)
X(2)
X(3)
X(4)
18W(1)
X(1)
-X(3)
W(0)
X(2)
X(3)
X(4)
First step, x(3) is correctly classified, But
x(2) becomes misclassified.
19W(2)
X(2)
W(1)
X(1)
X(2)
X(3)
All patterns are correctly classified.
X(4)