Title: Artificial Neural Networks
1Artificial Neural Networks
- Learning real-valued, discrete-valued, and
vector-valued functions from examples. - Robust to errors in training data.
- Applications interpreting visual scenes, speech
recognition, robot control strategies -
2Biological Motivation
- Human brain 1011 neurons
- Each connected to 104 others
- Switching time 10-3 seconds
- (Computer switching speed 10-10 seconds)
- It requires 10-1 seconds to recognize a human
face - ? highly parallel and distributed processes
3Biological Motivation
- ANN model is not the same as that of biological
neural systems - Using ANNs to study and model biological learning
processes - Obtaining highly effective machine learning
algorithms
4ANN Representation
.......
.......
5Appropriate Problems for ANNs
- Instances are represented by many attribute-value
pairs - Target function output may be discrete-valued,
real-valued, vector-valued - Training examples can contain errors
- Long training time is acceptable
- Fast evaluation of the learned target function
may be required - Understanding the learned target concept is not
important -
6Perceptrons
w1
x1
x0 1
w0
w2
x2
..........
? wixi
1 if ? wixi ? 0 o -1 otherwise
wn
xn
7Perceptrons
x2
x0 1
w0
-
w1
x1
x1
? wixi
-
xn
w2
1 if ? wixi ? 0 o -1 otherwise
A ??B
8Perceptron Training Rule
- wi ? wi ?wi
- ?wi ?(t o)xi
- t target output of the current training example
- o the thresholded output generated by the
perceptron - ? learning rate (positive constant)
9Perceptron Training Rule
wi ? wi ?wi ?wi (t o)xi
10Perceptron Training Rule
x2
x2
-
x1
x1
-
-
-
linearly separable
non linearly separable
11Perceptron Training Rule
- The learning procedure converges to a weight
vector that correctly classifies all linearly
separable training examples -
12Perceptron Training Rule
- Minsky, M. Papert, S. (1969). Perceptrons.
- MIT Press.
-
13Gradient Descent Rule
w1
x1
x1 1
w0
w2
x2
..........
? wixi
1 if ? wixi ? 0 o -1 otherwise
wn
xn
linear unit
14Gradient Descent Rule
- Training error
- E(w) ?d?D(td od)2/2
-
- td target output of training example d
- od the unthresholded output for d ( w. x)
-
15Gradient Descent Rule
16Gradient Descent Rule
- Gradient of E (steepest increase direction)
- ?E(w) ?E/?w0, ?E/?w1, ... , ?E/?wn
-
- w ? w ?w
-
- w ? w ??E(w)
-
17Gradient Descent Rule
- wi ? wi ?wi
-
- ?wi -??E/?wi
- ?E/?wi -?d?D(td od)xid
- ?wi ??d?D(td od)xid
18Gradient Descent Rule
- Converging to a local minimum can be quite slow
- No guarantee to converge to the global minimum
-
19Stochastic Approximation
- Delta rule
- ?wi ?(td od)xid
- E(w) (td od)2/2
20Stochastic Approximation
- Weights are updated upon examining each training
example - Less computation per weight update step is
required - Falling into local minima can be avoided
-
21Stochastic Approximation
- The delta rule converges towards a best-fit
approximation to the target concept, regardless
of whether the training data are linearly
separable -
22Multilayer Networks
- Single perceptrons can express only linear
decision surfaces - A multilayer network can represent highly
nonlinear decision surfaces -
23Multilayer Networks
head
hid
who'd
hood
.......
.......
F1
F2
24Multilayer Networks
25Multilayer Networks
- What type of unit ?
- Perceptrons non-differentiable
- Linear units only linear functions
- ....
-
26Multilayer Networks
w1
x1
x1 1
w0
w2
x2
..........
1 o ?(net) ???? 1 e-net
wn
net ? wixi
xn
sigmoid unit
27Multilayer Networks
- Sigmoid unit
- ??(y)/?y ?(y).(1 - ?(y))
-
28Backpropagation Algorithm
- Training error
- E(w) ?d?D ?k?outputs (tkd okd)2/2
-
-
-
29Backpropagation Algorithm
oh
ok
hid h
out k
in i
whi
wkh
?k
?h
?k ok(1 ok)(tk ok) ?h oh(1
oh)?kwkh?k wji ? wji ??jxji
30Backpropagation Algorithm
xji
j
i
wji
?j
wji ? wji ??jxji
31Backpropagation Algorithm
- Adding momentum
- ?wji(n) ??jxji a?wji(n - 1)
- iteration momentum
- Keeping the search direction ? passing small
local minima - Increasing the search step size ? speeding
convergence -
32Backpropagation Algorithm
- Learning in arbitrary acyclic networks
layer m
m1
or
os
r
s
wsr
?s
?r
?r or(1 or)?s?layer m1wsr?s
33Backpropagation Algorithm
- Convergence and local minima
- Not guaranteed to converge towards the global
minimum error, but highly effective in practice - Approximately linear when the weights are close
to 0, hence passing local minima of non-linear
functions -
34Backpropagation Algorithm
- Heuristics to alleviate the local minima problem
- Add a momentum term to the weight-update rule
- Use stochastic gradient descent rather than true
gradient descent - Train multiple networks using the same data, but
initializing each network with different random
weights
35Backpropagation Algorithm
- Representation power of feedforward networks
- Boolean functions any one, using 2-layer (1
hidden 1 output) networks - Continuous functions any bounded one with
approximation, using 2-layer networks - Arbitrary functions any one with approximation,
using 3-layer networks
36Backpropagation Algorithm
- Hypothesis space and inductive bias
- Hypothesis every possible assignment of network
weights - Inductive bias smooth interpolation between data
points
37Backpropagation Algorithm
- Hidden layer representations
identity function
38Backpropagation Algorithm
39Backpropagation Algorithm
- Stopping criterion and overfitting
- Number of iterations
- Limit of training errors
-
40Backpropagation Algorithm
41Applications
- To recognize face pose
- 30 ? 32 resolution input images
- 4 directions left, straight, right, up
- ? 960 ? 3 ? 4 network
42Exercises
- In Mitchells ML (Chapter 4) 4.1 to 4.10