Artificial Neural Networks - PowerPoint PPT Presentation

About This Presentation

Title:

Artificial Neural Networks

Description:

Least Mean Square Error. Multi-layer networks. Sigmoid node. Backpropagation ... Perceptron Algo. Correct Output (t=o) Weights are unchanged. Incorrect Output (t ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 36

Provided by: tau

Category:

more less

Transcript and Presenter's Notes

Title: Artificial Neural Networks

1
Artificial Neural Networks
2
Outline

Biological Motivation
Perceptron
Gradient Descent
Least Mean Square Error
Multi-layer networks
Sigmoid node
Backpropagation

3
Biological Neural Systems

Neuron switching time gt 10-3 secs
Number of neurons in the human brain 1010
Connections (synapses) per neuron 104105
Face recognition 0.1 secs
High degree of parallel computation
Distributed representations

4
Artificial Neural Networks

Many simple neuron-like threshold units
Many weighted interconnections
Multiple outputs
Highly parallel and distributed processing
Learning by tuning the connection weights

5
Perceptron Linear threshold unit
x01
w1
w0
w2
S
o
. . .
?i0n wi xi
wn
1 if ?i0n wi xi gt0 o(xi)
-1 otherwise

6
Decision Surface of a Perceptron
x2

-
-
x1

-
-
Linearly Separable
Theorem VC-dim n1
7
Perceptron Learning Rule
S sample xi input vector tc(x) is the target
value o is the perceptron output ? learning rate
(a small constant ), assume ?1
wi wi ?wi ?wi ? (t - o) xi
8
Perceptron Algo.

Correct Output (to)
Weights are unchanged
Incorrect Output (t?o)
Change weights !
False Positive (t1 and o-1)
Add x to w
False Negative (t-1 and o1)
Subtract x from w

9
Perceptron Learning Rule
10
Perceptron Algorithm Analysis

Theorem The number of errors of the Perceptron
Algorithm is bounded
Proof
Make all examples positive
change ltxi,bigt to ltbixi, 1gt
Margin of hyperplan w

11
Perceptron Algorithm Analysis II

Let mi be the number of errors of xi
M ? mi
From the algorithm w ? mixi
Let w be a separating hyperplane

12
Perceptron Algorithm Analysis III

Change in weights
Since w errs on xi , we have wxi lt0
Total weight

13
Perceptron Algorithm Analysis IV

Consider the angle between w and w
Putting it all together

14
Gradient Descent Learning Rule

Consider linear unit without threshold and
continuous output o (not just 1,1)
ow0 w1 x1 wn xn
Train the wis such that they minimize the
squared error
Ew1,,wn ½ ?d?S (td-od)2
where S is the set of training examples

15
Gradient Descent
Slt(1,1),1gt,lt(-1,-1),1gt,
lt(1,-1),-1gt,lt(-1,1),-1gt
?w-? ?Ew
?wi-? ?E/?wi ?/?wi 1/2?d(td-od)2 ?/?wi
1/2?d(td-?i wi xi)2 ?d(td- od)(-xi)
16
Gradient Descent

Gradient-Descent(Straining_examples, ?)
Until TERMINATION Do
Initialize each ?wi to zero
For each ltx,tgt in S Do
Compute oltx,wgt
For each weight wi Do
?wi ?wi ? (t-o) xi
For each weight wi Do1
wiwi?wi

17
Incremental Stochastic Gradient Descent

Batch mode Gradient Descent
ww - ? ?ESw over the entire data S
ESw1/2?d(td-od)2
Incremental mode gradient descent
ww - ? ?Edw over individual training
examples d
Edw1/2 (td-od)2
Incremental Gradient Descent can approximate
Batch Gradient Descent arbitrarily closely if ?
is small enough

18
Comparison Perceptron and Gradient Descent Rule

Perceptron learning rule guaranteed to succeed if
Training examples are linearly separable
No guarantee otherwise
Linear unit using Gradient Descent
Converges to hypothesis with minimum squared
error.
Given sufficiently small learning rate ?
Even when training data contains noise
Even when training data not linearly separable

19
Multi-Layer Networks
output layer
hidden layer(s)
input layer
20
Sigmoid Unit
x01
w1
w0
z?i0n wi xi
o?(z)1/(1e-z)
w2
S
o
. . .
wn
?(z) 1/(1e-z) sigmoid function.
21
Sigmoid Function
?(z) 1/(1e-z)
d?(z)/dz ?(z) (1- ?(z))

Gradient Decent Rule
one sigmoid function
?E/?wi -?d(td-od) od (1-od) xi
Multilayer networks of sigmoid units
backpropagation

22
Backpropagation overview

Make threshold units differentiable
Use sigmoid functions
Given a sample compute
The error
The Gradient
Use the chain rule to compute the Gradient

23
Backpropagation Motivation

Consider the square error
ESw1/2?d ? S ?k ? output (td,k-od,k)2
Gradient ?ESw
Update ww - ? ?ESw
How do we compute the Gradient?

24
Backpropagation Algorithm

Forward phase
Given input x, compute the output of each unit
Backward phase
For each output k compute

25
Backpropagation Algorithm

Backward phase
For each hidden unit h compute
Update weights
wi,jwi,j?wi,j where ?wi,j ? ?j xi

26
Backpropagation output node
27
Backpropagation output node
28
Backpropagation inner node
29
Backpropagation inner node
30
Backpropagation Summary

Gradient descent over entire network weight
vector
Easily generalized to arbitrary directed graphs
Finds a local, not necessarily global error
minimum
in practice often works well
requires multiple invocations with different
initial weights
A variation is to include momentum term
?wi,j(n) ? ?j xi ? ?wi,j (n-1)
Minimizes error training examples
Training is fairly slow, yet prediction is fast

31
Expressive Capabilities of ANN

Boolean functions
Every boolean function can be represented by
network with single hidden layer
But might require exponential (in number of
inputs) hidden units
Continuous functions
Every bounded continuous function can be
approximated with arbitrarily small error, by
network with one hidden layer Cybenko 1989,
Hornik 1989
Any function can be approximated to arbitrary
accuracy by a network with two hidden layers
Cybenko 1988

32
VC-dim of ANN