Title: Learning: Neural Networks
1Learning Neural Networks
- Artificial Intelligence
- CMSC 25000
- February 5, 2004
2Roadmap
- Neural Networks
- Motivation Overcoming perceptron limitations
- Motivation ALVINN
- Heuristic Training
- Backpropagation Gradient descent
- Avoiding overfitting
- Avoiding local minima
- Conclusion Teaching a Net to talk
3Perceptron Summary
- Motivated by neuron activation
- Simple training procedure
- Guaranteed to converge
- IF linearly separable
4Neural Nets
- Multi-layer perceptrons
- Inputs real-valued
- Intermediate hidden nodes
- Output(s) one (or more) discrete-valued
X1
Y1 Y2
X2
X3
X4
Inputs
Hidden
Hidden
Outputs
5Neural Nets
- Pro More general than perceptrons
- Not restricted to linear discriminants
- Multiple outputs one classification each
- Con No simple, guaranteed training procedure
- Use greedy, hill-climbing procedure to train
- Gradient descent, Backpropagation
6Solving the XOR Problem
o1
w11
Network Topology 2 hidden nodes 1 output
w13
x1
w01
w21
y
-1
w23
w12
w03
w22
x2
-1
w02
o2
Desired behavior x1 x2 o1 o2 y 0 0 0
0 0 1 0 0 1 1 0 1 0 1
1 1 1 1 1 0
-1
Weights w11 w121 w21w22 1 w013/2 w021/2
w031/2 w13-1 w231
7Neural Net Applications
- Speech recognition
- Handwriting recognition
- NETtalk Letter-to-sound rules
- ALVINN Autonomous driving
8ALVINN
- Driving as a neural network
- Inputs
- Image pixel intensities
- I.e. lane lines
- 5 Hidden nodes
- Outputs
- Steering actions
- E.g. turn left/right how far
- Training
- Observe human behavior sample images, steering
9Backpropagation
- Greedy, Hill-climbing procedure
- Weights are parameters to change
- Original hill-climb changes one parameter/step
- Slow
- If smooth function, change all parameters/step
- Gradient descent
- Backpropagation Computes current output, works
backward to correct error
10Producing a Smooth Function
- Key problem
- Pure step threshold is discontinuous
- Not differentiable
- Solution
- Sigmoid (squashed s function) Logistic fn
11Neural Net Training
- Goal
- Determine how to change weights to get correct
output - Large change in weight to produce large reduction
in error - Approach
- Compute actual output o
- Compare to desired output d
- Determine effect of each weight w on error d-o
- Adjust weights
12Neural Net Example
xi ith sample input vector w weight vector
yi desired output for ith sample
-
Sum of squares error over training samples
From 6.034 notes lozano-perez
Full expression of output in terms of input and
weights
13Gradient Descent
- Error Sum of squares error of inputs with
current weights - Compute rate of change of error wrt each weight
- Which weights have greatest effect on error?
- Effectively, partial derivatives of error wrt
weights - In turn, depend on other weights gt chain rule
14Gradient Descent
dG dw
- E G(w)
- Error as function of weights
- Find rate of change of error
- Follow steepest rate of change
- Change weights s.t. error is minimized
E
G(w)
w0w1
w
Local minima
15Gradient of Error
-
Note Derivative of sigmoid ds(z1)
s(z1)(1-s(z1)) dz1
From 6.034 notes lozano-perez
16From Effect to Update
- Gradient computation
- How each weight contributes to performance
- To train
- Need to determine how to CHANGE weight based on
contribution to performance - Need to determine how MUCH change to make per
iteration - Rate parameter r
- Large enough to learn quickly
- Small enough reach but not overshoot target values
17Backpropagation Procedure
i
j
k
- Pick rate parameter r
- Until performance is good enough,
- Do forward computation to calculate output
- Compute Beta in output node with
- Compute Beta in all other nodes with
- Compute change for all weights with
18Backprop Example
Forward prop Compute zi and yi given xk, wl
19Backpropagation Observations
- Procedure is (relatively) efficient
- All computations are local
- Use inputs and outputs of current node
- What is good enough?
- Rarely reach target (0 or 1) outputs
- Typically, train until within 0.1 of target
20Neural Net Summary
- Training
- Backpropagation procedure
- Gradient descent strategy (usual problems)
- Prediction
- Compute outputs based on input vector weights
- Pros Very general, Fast prediction
- Cons Training can be VERY slow (1000s of
epochs), Overfitting
21Training Strategies
- Online training
- Update weights after each sample
- Offline (batch training)
- Compute error over all samples
- Then update weights
- Online training noisy
- Sensitive to individual instances
- However, may escape local minima
22Training Strategy
- To avoid overfitting
- Split data into training, validation, test
- Also, avoid excess weights (less than samples)
- Initialize with small random weights
- Small changes have noticeable effect
- Use offline training
- Until validation set minimum
- Evaluate on test set
- No more weight changes
23Classification
- Neural networks best for classification task
- Single output -gt Binary classifier
- Multiple outputs -gt Multiway classification
- Applied successfully to learning pronunciation
- Sigmoid pushes to binary classification
- Not good for regression
24Neural Net Example
- NETtalk Letter-to-sound by net
- Inputs
- Need context to pronounce
- 7-letter window predict sound of middle letter
- 29 possible characters alphabetspace,.
- 729203 inputs
- 80 Hidden nodes
- Output Generate 60 phones
- Nodes map to 26 units 21 articulatory, 5
stress/sil - Vector quantization of acoustic space
25Neural Net Example NETtalk
- Learning to talk
- 5 iterations/1024 training words bound/stress
- 10 iterations intelligible
- 400 new test words 80 correct
- Not as good as DecTalk, but automatic
26Neural Net Conclusions
- Simulation based on neurons in brain
- Perceptrons (single neuron)
- Guaranteed to find linear discriminant
- IF one exists -gt problem XOR
- Neural nets (Multi-layer perceptrons)
- Very general
- Backpropagation training procedure
- Gradient descent - local min, overfitting issues