Title: Artificial Neural Networks
1Artificial Neural Networks
2Outline
- Biological Motivation
- Perceptron
- Gradient Descent
- Least Mean Square Error
- Multi-layer networks
- Sigmoid node
- Backpropagation
3Biological Neural Systems
- Neuron switching time gt 10-3 secs
- Number of neurons in the human brain 1010
- Connections (synapses) per neuron 104105
- Face recognition 0.1 secs
- High degree of parallel computation
- Distributed representations
4Artificial Neural Networks
- Many simple neuron-like threshold units
- Many weighted interconnections
- Multiple outputs
- Highly parallel and distributed processing
- Learning by tuning the connection weights
5Perceptron Linear threshold unit
x01
w1
w0
w2
S
o
. . .
?i0n wi xi
wn
1 if ?i0n wi xi gt0 o(xi)
-1 otherwise
6Decision Surface of a Perceptron
x2
-
-
x1
-
-
Linearly Separable
Theorem VC-dim n1
7Perceptron Learning Rule
S sample xi input vector tc(x) is the target
value o is the perceptron output ? learning rate
(a small constant ), assume ?1
wi wi ?wi ?wi ? (t - o) xi
8Perceptron Algo.
- Correct Output (to)
- Weights are unchanged
- Incorrect Output (t?o)
- Change weights !
- False Positive (t1 and o-1)
- Add x to w
- False Negative (t-1 and o1)
- Subtract x from w
9Perceptron Learning Rule
10Perceptron Algorithm Analysis
- Theorem The number of errors of the Perceptron
Algorithm is bounded - Proof
- Make all examples positive
- change ltxi,bigt to ltbixi, 1gt
- Margin of hyperplan w
11Perceptron Algorithm Analysis II
- Let mi be the number of errors of xi
- M ? mi
- From the algorithm w ? mixi
- Let w be a separating hyperplane
12Perceptron Algorithm Analysis III
- Change in weights
- Since w errs on xi , we have wxi lt0
- Total weight
13Perceptron Algorithm Analysis IV
- Consider the angle between w and w
- Putting it all together
14Gradient Descent Learning Rule
- Consider linear unit without threshold and
continuous output o (not just 1,1) - ow0 w1 x1 wn xn
- Train the wis such that they minimize the
squared error - Ew1,,wn ½ ?d?S (td-od)2
- where S is the set of training examples
15Gradient Descent
Slt(1,1),1gt,lt(-1,-1),1gt,
lt(1,-1),-1gt,lt(-1,1),-1gt
?w-? ?Ew
?wi-? ?E/?wi ?/?wi 1/2?d(td-od)2 ?/?wi
1/2?d(td-?i wi xi)2 ?d(td- od)(-xi)
16Gradient Descent
- Gradient-Descent(Straining_examples, ?)
- Until TERMINATION Do
- Initialize each ?wi to zero
- For each ltx,tgt in S Do
- Compute oltx,wgt
- For each weight wi Do
- ?wi ?wi ? (t-o) xi
- For each weight wi Do1
- wiwi?wi
17Incremental Stochastic Gradient Descent
- Batch mode Gradient Descent
- ww - ? ?ESw over the entire data S
- ESw1/2?d(td-od)2
- Incremental mode gradient descent
- ww - ? ?Edw over individual training
examples d - Edw1/2 (td-od)2
- Incremental Gradient Descent can approximate
Batch Gradient Descent arbitrarily closely if ?
is small enough
18Comparison Perceptron and Gradient Descent Rule
- Perceptron learning rule guaranteed to succeed if
- Training examples are linearly separable
- No guarantee otherwise
- Linear unit using Gradient Descent
- Converges to hypothesis with minimum squared
error. - Given sufficiently small learning rate ?
- Even when training data contains noise
- Even when training data not linearly separable
19Multi-Layer Networks
output layer
hidden layer(s)
input layer
20Sigmoid Unit
x01
w1
w0
z?i0n wi xi
o?(z)1/(1e-z)
w2
S
o
. . .
wn
?(z) 1/(1e-z) sigmoid function.
21Sigmoid Function
?(z) 1/(1e-z)
d?(z)/dz ?(z) (1- ?(z))
- Gradient Decent Rule
- one sigmoid function
- ?E/?wi -?d(td-od) od (1-od) xi
- Multilayer networks of sigmoid units
- backpropagation
-
22Backpropagation overview
- Make threshold units differentiable
- Use sigmoid functions
- Given a sample compute
- The error
- The Gradient
- Use the chain rule to compute the Gradient
23Backpropagation Motivation
- Consider the square error
- ESw1/2?d ? S ?k ? output (td,k-od,k)2
- Gradient ?ESw
- Update ww - ? ?ESw
- How do we compute the Gradient?
24Backpropagation Algorithm
- Forward phase
- Given input x, compute the output of each unit
- Backward phase
- For each output k compute
25Backpropagation Algorithm
- Backward phase
- For each hidden unit h compute
- Update weights
- wi,jwi,j?wi,j where ?wi,j ? ?j xi
26Backpropagation output node
27Backpropagation output node
28Backpropagation inner node
29Backpropagation inner node
30Backpropagation Summary
- Gradient descent over entire network weight
vector - Easily generalized to arbitrary directed graphs
- Finds a local, not necessarily global error
minimum - in practice often works well
- requires multiple invocations with different
initial weights - A variation is to include momentum term
- ?wi,j(n) ? ?j xi ? ?wi,j (n-1)
- Minimizes error training examples
- Training is fairly slow, yet prediction is fast
31Expressive Capabilities of ANN
- Boolean functions
- Every boolean function can be represented by
network with single hidden layer - But might require exponential (in number of
inputs) hidden units - Continuous functions
- Every bounded continuous function can be
approximated with arbitrarily small error, by
network with one hidden layer Cybenko 1989,
Hornik 1989 - Any function can be approximated to arbitrary
accuracy by a network with two hidden layers
Cybenko 1988
32VC-dim of ANN
- A more general bound.
- Concept class F(C,G)
- G Directed acyclic graph
- C concept class, dVC-dim(C)
- n input nodes
- s inner nodes (of degree r)
- Theorem VC-dim(F(C,G)) lt 2ds log (es)
33Proof
- Bound ?F(C,G)(m)
- Find smallest d s.t. ?F(C,G)(m) lt2m
- Let Sx1, , xm
- For each fixed G we define a matrix U
- Ui,j ci(xj), where ci is a specific i-th
concept - U describes the computations of S in G
- TF(C,G) number of different matrices.
34Proof (continue)
- Clearly ?F(C,G)(m) ? TF(C,G)
- Let G be G without the root.
- ?F(C,G)(m) ? TF(C,G) ? TF(C,G) ?C(m)
- Inductively, ?F(C,G)(m) ? ?C(m)s
- Recall VC Bound ?C(m) ? (em/d)d
- Combined bound ?F(C,G)(m) ?(em/d)ds
35Proof (cont.)
- Solve for (em/d)ds?2m
- Holds for m ? 2ds log(es)
- QED
- Back to ANN
- VC-dim(C)n1
- VC(ANN) ? 2(n1) log (es)