Title: EE645 Neural Networks and Learning Theory
1EE645Neural Networks and Learning Theory
Spring 2003
Prof. Anthony Kuh Dept. of Elec. Eng. University
of Hawaii Phone (808)-956-7527, Fax
(808)-956-3427 Email kuh_at_spectra.eng.hawaii.edu
2I. Introduction to neural networks
- Goal study computational capabilities of neural
network and learning systems. - Multidisciplinary field
- Algorithms, Analysis, Applications
3A. Motivation
- Why study neural networks and machine learning?
- Biological inspiration (natural computation)
- Nonparametric models adaptive learning systems,
learning from examples, analysis of learning
models - Implementation
- Applications
- Cognitive (Human vs. Computer Intelligence)
- Humans superior to computers in pattern
recognition, associative recall, learning complex
tasks. - Computers superior to humans in arithmetic
computations, simple repeatable tasks. - Biological (study human brain)
- 1010 to 1011 neurons in cerebral cortex with on
average of 103 interconnections / neuron.
4A neuron
Schematic of one neuron
5Neural Network
- Connection of many neurons together forms a
neural network.
- Neural network properties
- Highly parallel (distributed computing)
- Robust and fault tolerant
- Flexible (short and long term learning)
- Handles variety of information
- (often random, fuzzy, and inconsistent)
- Small, compact, dissipates very little power
6B. Single Neuron
(Computational node)
g( )
w
?
x
y
s
w
0
- sw T x w0 synaptic strength (linearly
weighted sum of inputs). - yg(s) activation or squashing function
7Activation functions
- Linear units g(s) s.
- Linear threshold units g(s) sgn (s).
- Sigmoidal units g(s) tanh (Bs), B gt0.
- Neural networks generally have nonlinear
activation functions.
Most popular models linear threshold units
and sigmoidal units.
Other types of computational units receptive
units (radial basis functions).
8C. Neural Network Architectures
Systems composed of interconnected neurons
output
inputs
Neural network represented by directed graph
edges represent weights and nodes represent
computational units.
9Definitions
- Feedforward neural network has no loops in
directed graph. - Neural networks are often arranged in layers.
- Single layer feedforward neural network has one
layer of computational nodes. - Multilayer feedforward neural network has two or
more layers of computational nodes. - Computational nodes that are not output nodes are
called hidden units.
10D. Learning and Information Storage
- Neural networks have computational capabilities.
- Where is information stored in a neural network?
- What are parameters of neural network?
- How does a neural network work? (two phases)
- Training or learning phase (equivalent to write
phase in conventional computer memory) weights
are adjusted to meet certain desired criterion. - Recall or test phase (equivalent to read phase in
conventional computer memory) weights are fixed
as neural network realizes some task.
11Learning and Information (continued)
- 3) What can neural network models learn?
- Boolean functions
- Pattern recognition problems
- Function approximation
- Dynamical systems
- 4) What type of learning algorithms are there?
- Supervised learning (learning with a teacher)
- Unsupervised learning (no teacher)
- Reinforcement learning (learning with a critic)
12Learning and Information (continued)
- 5) How do neural networks learn?
- Iterative algorithm weights of neural network
are adjusted on-line as training data is
received. - w(k1) L(w(k),x(k),d(k)) for supervised
learning where - d(k) is desired output.
- Need cost criterion common cost criterion
- Mean Squared Error for one output J(w) ?
(y(k) d(k)) 2 - Goal is to find minimum J(w) over all possible w.
Iterative techniques often use gradient descent
approaches.
13Learning and Information (continued)
- 6)Learning and Generalization
- Learning algorithm takes training examples as
inputs and produces concept, pattern or function
to be learned. - How good is learning algorithm? Generalization
ability measures how well learning algorithm
performs. - Sufficient number of training examples. (LLN,
typical sequences) - Occams razor simplest explanation is the
best.
Regression problem
14Learning and Information (continued)
- Generalization error
- ?g ?emp ?model
- Empirical error average error from training data
(desired output vs. actual output) - Model error due to dimensionality of class of
functions or patterns - Desire class to be large enough so that empirical
error is small and small enough so that model
error is small.
15II. Linear threshold units
A. Preliminaries
sgn( )
w
?
x
y
s
w
0
1, if sgt0 -1, if slt0
sgn(s)
16Linearly separable
Consider a set of points with two labels and
o.
Set of points is linearly separable if a linear
threshold function can partition the points
from the o points.
o
o
o
Set of linearly separable points
17Not linearly separable
A set of labeled points that cannot be
partitioned by a linear threshold function is not
linearly separable.
o
o
Set of points that are not linearly separable
18B. Perceptron Learning Algorithm
- An iterative learning algorithm that can find
linear threshold function to partition two set of
points. - w(0) arbitrary
- Pick point (x(k),d(k)).
- If w(k) T x(k)d(k) gt 0 go to 5)
- w(k1) w(k ) x(k)d(k)
- kk1, check if cycled through data, if not go
to 2 - Otherwise stop.
19PLA comments
- Perceptron convergence theorem (requires margins)
- Sketch of proof
- Updating threshold weights
- Algorithm is based on cost function
- J(w) - (sum of synaptic strengths of
misclassified points) - w(k1) w(k) - ?(k)J(w(k)) (gradient descent)
20Perceptron Convergence Theorem
- Assumptions w solutions and w1, no
threshold and w(0)0. Let maxx(k)? and min
y(k)x(k)Tw?. - ltw(k),wgtltw(k-1) x(k-1)y(k-1),wgt ?
ltw(k-1),wgt ? ? k ?. - w(k)2 ? w(k-1)2 x(k-1)2 ?
w(k-1)2 ? 2 ? k? 2 . - Implies that k ? ( ?/ ? ) 2 (max number of
updates).
21III. Linear Units
A. Preliminaries
w
?
x
sy
22Model Assumptions and Parameters
- Training examples (x(k),d(k)) drawn randomly
- Parameters
- Inputs x(k)
- Outputs y(k)
- Desired outputs d(k)
- Weights w(k)
- Error e(k) d(k)-y(k)
- Error criterion (MSE)
- min J(w) E .5(e(k)) 2
23Wiener solution
- Define P E(x(k)d(k)) and RE(x(k)x(k)T).
- J(w) .5 E(d(k)-y(k))2
- .5E(d(k)2)- E(x(k)d(k)) Tw wT
E(x(k)x(k) T)w - .5Ed(k) 2 PTw .5wTRw
- Note J(w) is a quadratic function of w. To
minimize J(w) find gradient, ?J(w) and set to 0. - ?J(w) -P Rw 0
- RwP (Wiener solution)
- If R is nonsingular, then w R-1 P.
- Resulting MSE .5Ed(k)2-PTR-1P
24Iterative algorithms
- Steepest descent algorithm (move in direction of
negative gradient) - w(k1) w(k) -? ?J(w(k)) w(k) ? (P-Rw(k))
- Least mean square algorithm
- (approximate gradient from training example)
- ?J(w(k)) -e(k)x(k)
- w(k1) w(k) ?e(k)x(k)
25Steepest Descent Convergence
- w(k1) w(k) ? (P-Rw(k)) Let w be solution.
- Center weight vector vw-w
- v(k1) v(k) - ? (Rw(k)) Assume R is
nonsingular. - Decorrelate weight vector u Q-1v where RQ? Q-1
is the transformation that diagonalizes R. - u(k1) (I - ? ? ), u(k) (I - ? ? )k u(0).
- Conditions for convergence 0lt ? lt 2/?max .
26LMS Algorithm Properties
- Steepest Descent and LMS algorithm convergence
depends on step size ? and eigenvalues of R. - LMS algorithm is simple to implement.
- LMS algorithm convergence is relatively slow.
- Tradeoff between convergence speed and excess
MSE. - LMS algorithm can track training data that is
time varying.
27Adaptive MMSE Methods
- Training data
- Linear MMSE LMS, RLS algorithms
- Nonlinear Decision feedback detectors
- Blind algorithms
- Second order statistics
- Minimum Output Energy Methods
- Reduced order approximations PCA, multistage
Wiener Filter - Higher order statistics
- Cumulants, Information based criteria
28Designing a learning system
- Given a set of training data, design a system
that can realize the desired task.
Signal Processing
Feature Extraction
Neural Network
Outputs
Inputs
29IV. Multilayer Networks
- A. Capabilities
- Depend directly on total number of weights and
threshold values. - A one hidden layer network with sufficient number
of hidden units can arbitrarily approximate any
boolean function, pattern recognition problems,
and well behaved function approximation problems. - Sigmoidal units more powerful than linear
threshold units.
30B. Error backpropagation
- Error backpropagation algorithm methodical way
of implementing LMS algorithm for multilayer
neural networks. - Two passes forward pass (computational pass),
backward pass (weight correction pass). - Analog computations based on MSE criterion.
- Hidden units usually sigmoidal units.
- Initialization weights take on small random
values. - Algorithm may not converge to global minimum.
- Algorithm converges slower than for linear
networks. - Representation is distributed.
31BP Algorithm Comments
- ?s are error terms computed from output layer
back to first layer in dual network. - Training is usually done online.
- Examples presented in random or sequential order.
- Update rule is local as weight changes only
involve connections to weight. - Computational complexity depends on number of
computational units. - Initial weights randomized to avoid converging to
local minima.
32BP Algorithm Comment continued
- Threshold weights updated in similar manner to
other weights (input 1). - Momentum term added to speed up convergence.
- Step size set to small value.
- Sigmoidal activation derivatives simple to
compute.
33BP Architecture
Output of computational values calculated
Forward network
Output of error terms calculated
Sensitivity network
34Modifications to BP Algorithm
- Batch procedure
- Variable step size
- Better approximation of gradient method (momentum
term, conjugate gradient) - Newton methods (Hessian)
- Alternate cost functions
- Regularization
- Network construction algorithms
- Incorporating time
35When to stop training
- First major features captured. As training
continues minor features captured. - Look at training error.
- Crossvalidation (training, validation, and test
sets)
testing error
training error
Learning typically slow and may find flat
learning areas with little improvement in energy
function.
36C. Radial Basis Functions
- Use locally receptive units (potential functions)
- Transform input space to hidden unit space via
potential functions. - Output unit is linear.
?
output
?
inputs
Linear unit
?
Potential units ?(x) exp (-.5x-c 2 /? 2
37Transformation of input space
?
X
X
O
X
O
O
X
O
Input space
Feature space
? X Z
38Training Radial basis functions
- Use gradient descent on unknown parameters
centers, widths, and output weights - Separate tasks for quicker training (first layer
centers, widths), (second layer weights) - First layer
- Fix widths, centers determined from lattice
structure - Fix widths, clustering algorithm for centers
- Resource allocation network
- Second layer use LMS to learn weights
39Comparisons between RBFs and BP Algorithm
- RBF single hidden layer and BP algorithm can have
many hidden layers. - RBF (potential functions) locally receptive units
versus BP algorithm (sigmoidal units) distributed
representations. - RBF typically many more hidden units.
- RBF training typically quicker training.
40V. Alternate Detection Method
- Consider detection methods based on optimum
margin classifiers or Support Vector Machines
(SVM) - SVM are based on concepts from statistical
learning theory. - SVM are easily extended to nonlinear decision
regions via kernel functions. - SVM solutions involve solving quadratic
programming problems.
41Optimal Marginal Classifiers
X
Given a set of points that are linearly separable
X
X
X
Which hyperplane should you choose to separate
points?
O
O
O
Choose hyperplane that maximizes distance between
two sets of points.
42 Finding Optimal Hyperplane
margins
- Draw convex hull around each set of points.
- Find shortest line segment connecting two convex
hulls. - Find midpoint of line segment.
- Optimal hyperplane intersects line segment at
midpoing perpendicular to line segment.
X
X
X
w
X
O
O
O
Optimal hyperplane
43 Alternative Characterization of Optimal
Margin Classifiers
Maximizing margins equivalent to minimizing
magnitude of weight vector.
X
2m
X
X
T
W (u-v) 2
w
T
X
W (u-v)/ W 2/ W 2m
u
T
O
W u b 1
O
v
O
T
W v b -1
44Solution in 1 Dimension
O O O O O X O X X O X X X
Points on wrong side of hyperplane
If C is large SV include
If C is small SV include all points (scaled MMSE
solution)
Note that weight vector depends most heavily on
outer support vectors.
45Comments on 1 Dimensional Solution
- Simple algorithm can be implemented to solve 1D
problem. - Solution in multiple dimensions is finding
weight and then projecting down to 1D. - Min. probability of error threshold depends on
likelihood ratio. - MMSE solution depends on all points where as SVM
depends on SV (points that are under margin
(closer to min. probability of error). - Min. probability of error, MMSE solution, and
SVM in general give different detectors.
46Kernel Methods
In many classification and detection problems a
linear classifier is not sufficient. However,
working in higher dimensions can lead to curse
of dimensionality.
Solution Use kernel methods where computations
done in dual observation space.
?
X
X
O
X
O
O
X
O
Input space
Feature space
? X Z
47Solving QP problem
- SVM require solving large QP problems. However,
many ?s are zero (not support vectors). Breakup
QP into subproblem. - Chunking (Vapnik 1979) numerical solution.
- Ossuna algorithm (1997) numerical solution.
- Platt algorithm (1998) Sequential Minimization
Optimization (SMO) analytical solution.
48SMO Algorithm
- Sequential Minimization Optimization breaks up QP
program into small subproblems that are solved
analytically. - SMO solves dual QP SVM problem by examining
points that violate KKT conditions. - Algorithm converges and consists of
- Search for 2 points that violate KKT conditions.
- Solve QP program for 2 points.
- Calculate threshold value b.
- Continue until all points satisfy KKT conditions.
- On numerous benchmarks time to convergence of
SMO varied from O (l) to O (l 2.2 ) .
Convergence time depends on difficulty of
classification problem and kernel functions used.
49SVM Summary
- SVM are based on optimum margin classifiers and
are solved using quadratic programming methods. - SVM are easily extended to problems that are not
linearly separable. - SVM can create nonlinear separating surfaces via
kernel functions. - SVM can be efficiently programmed via the SMO
algorithm. - SVM can be extended to solve regression problems.
50VI.Unsupervised Learning
- Motivation
- Given a set of training examples with no teacher
or - critic, why do we learn?
- Feature extraction
- Data compression
- Signal detection and recovery
- Self organization
- Information can be found about data from inputs.
51B. Principal Component Analysis
- Introduction
- Consider a zero mean random vector x ? R n with
autocorrelation matrix R E(xxT). - R has eigenvectors q(1), ,q(n) and associated
eigenvalues ?(1)? ? ?(n). - Let Q q(1) q(n) and ? be a diagonal
matrix containing eigenvalues along diagonal. - Then R Q ? QT can be decomposed into
eigenvector and eigenvalue decomposition.
52First Principal Component
- Find max xTRx subject to x1.
- Maximum obtained when x q(1) as this
corresponds to xTRx ?(1). - q(1) is first principal component of x and also
yields direction of maximum variance. - y(1) q(1)T x is projection of x onto first
principal component.
x
q(1)
y(1)
53Other Principal Components
- ith principal component denoted by q(i) and
projection denoted by y(i) q(i)T x with
E(y(i)) 0 and E(y(i)2) ?(i). - Note that y QTx and we can obtain data vector
x from y by noting that xQy. - We can approximate x by taking first m principal
components (PC) to get z z q(1)x(1)
q(m)x(m). Error given by e x-z. e is orthogonal
to q(i) when 1? i ? m.
54Diagram of PCA
x
x
x
x
x
x
Second PC
x
x
First PC
x
x
x
x
x
x
x
x
x
x
x
x
First PC gives more information than second PC.
55Learning algorithms for PCA
- Hebbian learning rule when presynaptic and
postsynaptic signal are postive, then weigh
associated with synapse increase in strength.
w
x
y
?w ? x y
56Ojas rule
- Use normalize Hebbian rule applied to linear
neuron.
w
?
x
sy
Need normalized Hebbian rule otherwise
weight vector will grow unbounded.
57Ojas rule continued
- wi (k1) wi(k) ? xi (k) y(k) (apply Hebbian
rule) - w(k1) w(k1) / w(k1) (renormalize weight)
- Unfortunately above rule is difficult to
implement so modification approximates above rule
giving - wi (k1) wi(k) ? y(k)(xi (k)- y(k) wi(k))
- Similar to Hebbian rule with modified input.
- Can show that w(k) ? q(1) with probability one
given that x(k) is zero mean second order and
drawn from a fixed distribution.
58Learning other PCs
- Adaptive learning rules (subtract larger PCs
out) - Generalized Hebbian Algorithm
- APEX
- Batch Algorithm (singular value decomposition)
- Approximate correlation matrix R with time
averages.
59Applications of PCA
- Matched Filter problem x(k) s(k) ?v(k).
- Multiuser communications CDMA
- Image coding (data compression)
GHA
quantizer
PCA
60Kernel Methods
In many classification and detection problems a
linear classifier is not sufficient. However,
working in higher dimensions can lead to curse
of dimensionality.
Solution Use kernel methods where computations
done in dual observation space.
?
X
X
O
X
O
O
X
O
Input space
Feature space
? X Z
61C. Independent Component Analysis
- PCA decorrelates inputs. However in many
instances we may want to make outputs independent.
U
Y
A
W
X
A
Inputs U assumed independent and user sees
X. Goal is to find W so that Y is independent.
62ICA Solution
- Y DPU where D is a diagonal matrix and P is a
permutation matrix. - Algorithm is unsupervised. What are assumptions
where learning is possible? All components of U
except possibly one are nongaussian. - Establish criterion to learn from (use higher
order statistics) information based criteria,
kurtosis function. - Kullback Leibler Divergence
- D(f,g) ? f(x) log (f(x)/g(x)) dx
63ICA Information Criterion
- Kullback Leibler Divergence nonnegative.
- Set f to joint density of Y and g to products of
marginals of Y then - D(f,g) -H(Y) ?H(Yi)
- which is minized when components of Y are
independent. - When outputs are independent they can be a
permutation and scaled version of U.
64Learning Algorithms
- Can learn weights by approximating divergence
cost function using contrast functions. - Iterative gradient estimate algorithms can be
used. - Faster convergence can be achieved with fixed
point algorithms that approximate Newtons
methods. - Algorithms have been shown to converge.
65Applications of ICA
- Array antenna processing
- Blind source separation speech separation,
biomedical signals, financial data
66D. Competitive Learning
- Motivation Neurons compete with one another with
only one winner emerging. - Brain is a topologically ordered computational
map. - Array of neurons self organize.
- Generalized competitive learning algorithm.
- Initialize weights
- Randomly choose inputs
- Pick winner.
- Update weights associated with winner.
- Go to 2).
67Competitive Learning Algorithm
- K means algorithm (no topological ordering)
- Online algorithm
- Update centers
- Reclassify points
- Converges to local minima
- Kohonen Self Organization Feature Map
(topological ordering) - Neurons arranged on lattice
- Weight that are updated depend on winner, step
size, and neighborhood. - Decrease step size and neighborhood size to get
topological ordering.
68KSOFM 2 dimensional lattice
69Neural Network Applications
- Backgammon (Feedforward network)
- 459-24-24-1 network to rate moves
- Hand crafted examples, noise helped in training
- 59 winning percentage against SUN gammontools
- Later versions used reinforcement learning
- Handwritten zip code (Feedforward network)
- 16-768-192-30-10 network to distinguish numbers
- Preprocessed data, 2 hidden layers act as
feature detectors - 7291 training examples, 2000 test examples
- Training data .14, test data 5, test/reject
data 1,12
70Neural Network Applications
- Speech recognition
- KSOFM map followed by feedforward neural network
- 40 120 frames mapped onto 12 by 12 Kohonen map
- Each frame composed of 600 to 1800 analog vector
- Output of Kohonen map fed to feedforward network
- Reduced search using KSOFM map
- TI 20 word data base 98-99 correct on speaker
dependent classsification
71Other topics
- Reinforcement learning
- Associative networks
- Neural dynamics and control
- Computational learning theory
- Bayesian learning
- Neuroscience
- Cognitive science
- Hardware implementation