Title: Chapter 4: Artificial Neural Networks
1Chapter 4 Artificial Neural Networks
2- Artificial neural network(ANN)
- General, practical method for learning
real-valued, discrete-valued, vector-valued
functions from examples - BACPROPAGATION ????
- Use gradient descent to tune network parameters
to best fit a training set of input-output pairs - ANN learning
- Training example? error? ???.
- Interpreting visual scenes, speech recognition,
learning robot control strategy
3Biological motivation
- ????? ???? ???
- ?? ??(parallel computing)
- ?? ??(distributed representation)
- ????? ???? ???
- ?? ??(??)? ??
4ALVINN system
5??? ??? ??? ??
- ???? ?? ??? ?? ?? ??? ?? ???? ??
- ?? ??? ??? ??? ??? ?? ?? ? ??.
- ?? ??? ??(noise)? ??? ???
- ? ?? ??
- ?? ??? ??? ??
- ??? ??? ??? ???? ?? ???? ??
6Perceptrons
- vector of real-valued input
- weights threshold
- learning choosing values for the weights
7- Perceptron learning? hypotheses space
- n input vector? ??
8Perceptron? ???
- linearly separable example? ?? hyperplane
decision surface - many boolean functions(XOR ??)
- m-of-n function
- disjunctive normal form ??? unit
9Perceptron rule
- ???? ?? ? ??? ???? ????? ????? ? ??
- training example? linearly separable
- ??? ?? learning rate
10Gradient descent Delta rule
- for non-linearly separable
- unthresholded
- od ? w? ?? ???
11Hypethesis space
12Gradient descent
- gradient steepest increase in E
13(No Transcript)
14Gradient descent(contd)
- Training example? linearly separable ??? ???? ???
global minimum? ???. - Learning rate? ? ?? overstepping? ?? -gt learning
rate? ????? ??? ??? ????? ??.
15Stochastic approximation to gradient descent
- Gradient descent? ???? ??
- hypothesis space is continuously parameterized
- error? hypothesis parameter? ?? ?? ???? ??.
- Gradient descent? ??
- ??? ?? ???.
- ??? local minima? ???? ??
16Stochasticapproximation togradient
descent(contd)
- ??? training example? ???? E? ??? ?? weight?
????. - ??? descent gradient? ??
- ?? ?? learning rate? ??
- multiple local minima? ?? ???? ??.
- Delta rule
17Remark
- Perceptron rule
- thresholded output
- ??? weight
- linearly separable
- Delta rule
- unthresholded output
- ????? ??? ????? weight
- non-linearly separable
18Multilayer networks
- Nonlinear decision surface
19Differential threshold unit
- Sigmoid function
- nonlinear, differentiable
20i
j(h)
k
o1
o1
net1
net1
x1
x21
w12
w21
o2
o2
net2
x2
w22
net2
w22
x22
w23
w32
x23
o3
o3
x3
net3
net3
21(No Transcript)
22BACKPROPAGATION????
23BACKPROPAGATION????(contd)
- Multiple local minima
- Termination
- fixed number of iteration
- error threshold
- error of separate validation set
24BACKPROPAGATION????(contd)
- Adding momentum
- ??? loop??? weight ??? ??? ??
- Learning in arbitrary acyclic network
- downstream(r)
25BACKPROPAGATION rule
26BACKPROPAGATION rule(contd)
- Training rule for output unit
27i
j(h)
k
o1
o1
net1
net1
x1
x21
w12
w21
o2
o2
net2
x2
w22
net2
w22
x22
w23
w32
x23
o3
o3
x3
net3
net3
28BACKPROPAGATION rule(contd)
- Training rule for hidden unit
29(No Transcript)
30Convergence and local minima
- Only guarantee local minima
- This problem is not severe
- Algorithm is highly effective
- the more weights, the less local minima problem
- weight? ??? 0? ??? ??? ???
- ???
- momentum, stochastic, ??? network
31Feedfoward network? ???
- Boolean functions
- with two layers
- disjunctive normal form
- ??? ??? ??? hidden unit
- Continuous functions(bounded)
- with two layers
- Arbitrary functions
- with three layers
- linear combination of small functions
32- Hypothesis space search
- continuous -gt distinct?? ??
- Inductive bias
- characterize? ???
- ??? interpolation
33Hidden layer representation
- ??? ?? ??? ??? ???? hidden layer? ???? ??? ??.
- ??? ?? ?? ? feature?? ???? ???? ???? ?? ? ? ??
??? ????? ????.
34(No Transcript)
35(No Transcript)
36(No Transcript)
37(No Transcript)
38Generalization, overfitting, stopping criterion
- Terminating condition
- error threshold? ??
- Generalization accuracy? ??
- Weight decay
- Validation data
- Cross-validation approach
- K-fold cross-validation
39(No Transcript)
40Face recognition
- for non-linearly separable
- unthresholded
- od ? w? ?? ???
41- Input image120128 -gt3032
- ???? ??? ??
- mean value(cf, ALVINN)
- 1-of-n output encoding
- many weights
- ??? ??? ??
- lt0.9, 0.1, 0.1, 0.1gt
- 2 layers, 3 units -gt 90 success
- learned hidden units
42Alternativce error functions
- Weight-tuning rule? ??? ????? ???? ?? ??
- Penalty term for weight magnitude
- reducing the risk of overfitting
- Derivative of target function
- Minimizing cross-entropy
- for probabilistic function
- Weight sharing
- speech recognition
43(No Transcript)
44Alternative error minimization procedures
- Line search
- direction same as backpropagation
- distance minimum of the error function in this
line - very large or very small
- Conjugate gradient
- new direction component of the error gradient
remains zero
45Recurrent networks
46Dynamically modifying network structure
- ?? ???? ???? ?? ??? ??
- ??(without hidden unit)
- CASCADE-CORRELATION
- ?? ?? ??, overfitting ??
- ??
- optimal brain damage
- ?? ?? ??