Title: Neural Network Architecture and Learning
1Neural Network Architecture and Learning
- Guest Lecture by
- Some slides by Jim Rehg
2Recursive Error Propagation
Now we can compute the errors further from the
output recursively. And use the results to
compute gradients for intermediate weights.
3Summary Calculating the Gradient Using Backprop
Do forward pass with current parameters to obtain
Compute errors for the output nodes
Recursion for backpropagating the error from
output to input
Weight gradient given by (backwards) error and
(forwards) node prediction
Reduce Error through gradient-based methods
4Properties of Neural Networks
- Fixed number of basis functions adapt to the data
- Universal function approximator
- Wide range of architectural possibilities
- Trivially easy to handle very large datasets
(out-of-memory training) - Patterns are presented to network sequentially
and weights updated - Backprop is efficient O(w)
- Many ways to make it faster
- Hessian update, conjugate gradient, fastprop, etc.
5Adaptive Bases
6Construction of Input-Output Mapping
7 Neural Net as Universal Function Approximator
Fig. 5.3 from Bishop, Inputs are all
one-dimensional in these examples. Neural nets
are powerful. Training data, Learned function
8Modular Training Via Jacobian
Fig. 5.8 from Bishop
- Given pre-trained model
- How to update weight w in blue module
efficiently? - Green module has no effect
- Red module participates in learning via its
Jacobian
9Challenges in Neural Net Training
- Objective function is nonlinear, nonconvex
- Local minima are a significant problem
- How to control capacity?
10Capacity Control
- Capacity of network is roughly the number of
hidden units - Many schemes for determining the number of hidden
units - Standard approach to capacity control is
regularization via early stopping
11Early Stopping for Regularization
Fig. 5.13 from Bishop
12Numerical Optimization
- Training is local, gradient-based method
- Various techniques for avoiding local minima
- Momentum, stochastic gradient, etc.
- Initialization procedure must be well-designed
- Suppose weights are chosen to saturate function
outputs? - Suppose weights are initialized to zero?
- Solution Initialize weights to small nonzero
values (on linear part of function)
13Invariance
- How to handle invariance to nuisance parameters
- Rotation, position, scale of patterns such as
handwritten digits - Solution 1 Augment training data set
14Invariance
- Solution 2 Tangent Propagation
15Tangent Propagation
16Convolutional Networks