Title: CS515 Neural Networks
1- Performance Optimization
-
- Steepest Descent
-
2Basic Optimization Algorithm
or
pk - Search Direction
?k - Learning Rate
3Steepest Descent
Choose the next step so that the function
decreases
4Steepest Descent
For small changes in x we can approximate F(x)
where
5Steepest Descent
If we want the function to decrease
6Steepest Descent
If we want the function to decrease
We can maximize the decrease by choosing
7Steepest Descent
We can maximize the decrease by choosing
Two general methods to select ak - minimize
F(x) w.r.t. ak - use a predetermined value (e.g.
0.2, 1/k)
8Example
9Plot
10Stable Learning Rates (Quadratic)
Stability is determined by the eigenvalues
of this matrix.
(?i - eigenvalue of A)
Eigenvalues of I - ?A.
11Stable Learning Rates (Quadratic)
Stability is determined by the eigenvalues
of this matrix.
(?i - eigenvalue of A)
Eigenvalues of I - ?A.
Stability Requirement (given)
12Example
13CHAPTER 10
14Objectives
- Widrow-Hoff learning is an approximate steepest
descent algorithm, in which the performance index
is mean square error. - It is widely used today in many signal processing
applications. - It is precursor to the backpropagation algorithm
for multilayer networks.
15ADALINE Network
- ADALINE (Adaptive Linear Neuron) network and its
learning rule, LMS (Least Mean Square) algorithm
are proposed by Widrow and Marcian Hoff in 1960. - Both ADALINE network and the perceptron suffer
from the same inherent limitation they can only
solve linearly separable problems. - The LMS algorithm minimizes mean square error
(MSE), and therefore tires to move the decision
boundaries as far from the training patterns as
possible.
16ADALINE Network
17Single ADALINE
- Set n 0, then Wp b 0 specifies a decision
boundary. - The ADALINE can be used to classify objects into
two categories if they are linearly separable.
18Mean Square Error
- The LMS algorithm is an example of supervised
training. - The LMS algorithm will adjust the weights and
biases of the ADALINE in order to minimize the
mean square error, where the error is the
difference between the target output (tq) and the
network output (pq).
MSE
E expected value
19Performance Optimization
- Develop algorithms to optimize a performance
index F(x), where the word optimize will mean
to find the value of x that minimizes F(x). - The optimization algorithms are iterative as
or a
search direction positive
learning rate, which determines the
length of the step initial guess
20Taylor Series Expansion
- Taylor series
- Vector case
21Gradient Hessian
22Directional Derivative
- The ith element of the gradient, ?F(x)??xi, is
the first derivative of performance index F along
the xi axis. - Let p be a vector in the direction along which we
wish to know the derivative.Directional
derivative - . Find the derivative of
F(x) at the point in the
direction
23Approximated-Based Formulation
- Given input/output training data p1,t1,
p2,t2,, pQ,tQ. The objective of network
training is to find the optimal weights to
minimize the error (minimum-squares error)
between the target value and the actual response. - Model (network) function
- Least-squares-error function
- The weight vector x can be training by minimizing
the error function along the gradient-descent
direction
24Delta Learning Rule
- ADALINE
- Least-Squares-Error Criterionminimize
- Gradient
- Delta learning rule
25Mean Square Error
?
?
26Mean Square Error
- If the correlation matrix R is positive definite,
there will be a unique stationary point
, which will be a strong minimum. - Strong Minimum the point is a strong minimum
of F(x) if a scalar exists, such that - for all ?x such that .
- Global Minimum the point is a unique global
minimum of F(x) for all . - Weak Minimum the point is a weak minimum of
F(x) if it is not a strong minimum, and a scalar
exists, such that
for all ?x such that .
27LMS Algorithm
- LMS algorithm is to locate the minimum point.
- Use an approximate steepest descent algorithm to
estimate the gradient. - Estimate the mean square error F(x) by
- Estimated gradient
28LMS Algorithm
?
?
?
?
29LMS Algorithm
?
30Quadratic Functions
(A Hessian matrix)
?
31Stable Learning Rates
32Stable Learning Rates
33Analysis of Convergence
?
34Orange/Apple Example
?
In practical applications, the stable learning
rate ? might NOT be practical to calculate R, and
? could be selected by trial and error.
35Orange/Apple Example
?
Start, arbitrary, with all the weights set to
zero, and then will apply input p1, p2, p1, p2,
etc., in that order, calculating the new weights
after each input is presented.
36Orange/Apple Example
37Solved Problem P10.2
Since they are linear separable, we can design an
ADALINE network to make such a distinction.
As shown in figure,
They are NOT linear separable, so an ADALINE
network CANNOT distinguish between them.
38Solved Problem P10.3
39Solved Problem P10.3
The Hessian matrix of F(x), 2R, has both
eigenvalues at 2. So the contour of the
performance surface will be circular. The center
of the contours (the minimum point) is .
40Solved Problem P10.4
41Tapped Delay Line
At the output of the tapped delay line we have an
R-dim. vector, consisting of the input signal at
the current time and at delays of from 1 to R1
time steps.
42Adaptive Filter
43Solved Problem P10.1
44Solved Problem P10.1
45Solved Problem P10.1
?
?
46Solved Problem P10.6
Application of ADALINE adaptive predictor The
purpose of this filter is to predict the next
value of the input signal from the two previous
values. Suppose that the input signal is a
stationary random process with autocorrelation
function given by
47Solved Problem P10.6
48Solved Problem P10.6
49Solved Problem P10.6
ii.
The maximum stable value of the learning for the
LMS algorithm
iii.
The LMS algorithm is approximate steepest
descent, so the trajectory for small learning
rates will move perpendicular to the contour
lines.
50Applications
- Noise cancellation system to remove 60-Hz noise
from EEG signal (Fig. 10.6) - Echo cancellation system in long distance
telephone lines (Fig. 10.10) - Filtering engine noise from pilots voice signal
(Fig. P10.8)