An Introduction to Artificial Neural Networks

About This Presentation

Title:

An Introduction to Artificial Neural Networks

Description:

It allows us to correct every weight of a network in such a way co reduce the error Repeating the process on and on ... www.nd.com MATLAB Neural Networks ... – PowerPoint PPT presentation

Number of Views:2021

Avg rating:3.0/5.0

Slides: 154

Provided by: PiotrG7

Category:

more less

Transcript and Presenter's Notes

Title: An Introduction to Artificial Neural Networks

1
An Introduction to Artificial Neural Networks

Piotr Golabek, Ph.D.
Radom Technical University
Poland
pgolab_at_pr.radom.net

2
An overview of the lecture

What are ANNs? What are they for?
Neural networks as inductive machines inductive
reasoning tradition
The evolution of the concept keywords,
structures, algorithms

3
An overview of the lecture

Two general tasks classification and
approximation
Above tasks in more familiar setting decision
making, signal processing, control systems
live presentations

4
What are ANNs?

Dont ask me ...
ANN is a set of processing elements (PEs),
influencing each other
(that definition suit almost everything...)

5
What are ANNs

... but seriously...
neural following biological
(neurophysiological) inspiration,
artificial dont forget these are not real
neurons!
networks strongly interconnected (in fact
massive parallel processing)
and the implicit meaning
ANNs are learning machines, i.E. adapt, just as
biological neurons do

6
Machine learning

Important field of AI
A computer program is said to learn from
experience E with respect to some class of tasks
T and performance measure P, if its performance
at tasks in T, as measured by P, improves with
experience E
(Take a look at Machine Learning by Tom
Mitchell)

7
What is ANN?

In case of ANNs, the Experience is input data
(examples)
The ANN is a inductive learning machine, i.E.
machine constructing internal generalized
concepts based on evidence brought by data stream
ANN learns from examples a paradigm shift

8
What is ANN

Structurally, ANN is a complex, interconnected
structure composed of simple processing elements,
often mimicking biological neurons
Functionally, ANN is an inductive learning
machine, it is able to undergo an adaptation
process (learning) driven by examples

9
What are ANNs used for?

Recognition of images, OCR
Recognition of time signal signatures vibration
diagnostic, sonar signal interpretation,
detection intrusion patterns in various
transaction systems
Trend prediction, esp. in financial markets (bond
rating prediction)
Decision support, eg. in credit assessment,
medical diagnosis
Industrial process control, eg. the melting
parameters in metallurgical processes
Adaptive signal filtering to restore the
information from corrupted source

10
Inductive process

Concepts rooted in epistemology (episteme
knowledge)
Heraclitus The nature likes to hide
Observations vs the true nature of the phenomenon
The empiric (experimental) method of developing
the model (hypothesis) of the true phenomenon
the inductive process
Something like this goes on during ANN learning

11
ANN as inductive learning machine

The theory the way ANN behaves
Experimental data examples the ANN learns
from
New examples cause the ANN to change its
behaviour, in order to fit better to the evidence
brought by examples

12
Inductive process

Inductive bias - the initial theory (a priori
knowledge)
Variance the evidence brought by data
The strong bias prevents the data to affect the
theory
The weak bias makes the theory vulnerable to the
data corruption
The game is to properly set the bias-variance
balance

13
ANN as inductive learning machines

We can shape the inductive bias of learning
process e.g. by tuning the number of neurons
The more neurons, the more flexible the network
(the more sensitive to data)

14
Inductive vs deductive reasoning

Reasoning premises ? conclusions
Deductive reasoning the conclusions are more
specific than premises (we just reason the
consequences)
Inductive reasoning the conclusions are more
general than premises (we reason the general
rules governing the phenomenon from the specific
examples)

15
The main goal of inductive reasoning

The main goal To achive the good generalization
to reason the rule general enough, that it fits
to any futer data
This is also the main goal of machine learning
to use the experience in order to build good
enough performance (in every possible future
situation)

16
McCulloch-Pitts model
Warren McCulloch

Walter Pitts

A Logical Calculus Immanent in Nervous
Activity, 1943
17
McCulloch-Pitts model

Logical calculus approach
elementary logical operations AND, OR, NOT
basic reasoning operator, implication
(given premises p, we draw conclusion q)

18
McCulloch-Pitts model

Logical operators are functions
Truth tables

x y x ? y
0 0 1
0 1 1
1 0 0
1 1 1
x y x AND y
0 0 0
0 1 0
1 0 0
1 1 1
x y x OR y
0 0 0
0 1 1
1 0 1
1 1 1
x NOT x
0 1
1 0
19
McCulloch-Pitts model

The working question whether a neuron can
perform logical functions AND, OR, NOT
If the answer is yes, the chain of implications
(reasoning) could be implemented in neural network

20
McCulloch-Pitts model
Inputs
Weights
Neuron output (activation)
Summation
Total exicitation
Activation function
Activation threshold
21
McCulloch-Pitts transfer function
22
Implementation of AND, OR, NOT

McCulloch-Pitts neuron

23
Including threshold into weights
24
McCulloch-Pitts model

Neuron equations

25
(vector dot product)
26
(vector dot product)
x
x
w
w
max antisimilarity
max dissimilarity (orthogonality)
max similarity
27
Vector dot product interpretation

Inputs are called input vector
weights are called weight vector
Neuron excites, when input vector is similar
enough to the weight vector
Weight vector is a template for some set of
input vectors

28
Neurons elements of the ANNs

Dont be fooled...
These are our neurons ...

29
Neurons elements of the ANNs
Single neuron (stereoscopic)
30
Neurony - elementy skladowe sieci neuronowych

There is some analogy...

31
The real neuron
Synaptic connection organic structure
32
The real neuron
Synaptic connection the molecular level
33
McCulloch-Pitts model

The conclusion
If we tune the weights of the neuron properly, we
can make it implement the transfer function we
need (AND, OR, NOT)
The question
What the weights of neurons are tuned in our
brains, what is the adaptation mechanism

34
Adaptacja neuronu

Donald Hebb (1949, neurophysiologist)When an
axon of cell A is near enough to excite a cell B
and repeatedly or persistently takes part in
firing it, some growth process or metabolic
change takes place in one or both cells such that
As efficiency, as one of the cells firing B, is
increased.

35
Hebb rule
36
Hebb rule

It is a local rule of adaptation
The multiplication of input and output signifies
a correlation between them
The rule is unstable a weight can grow without
limits
(that doesnt happen in nature, where there
are limited resources)
numerous modifications of the Hebb rule has been
proposed, to make it stable

37
Hebb rule

Hebb rule is very important and useful ...
... but for now we want to make the neuron to
learn the function we need

38
Rosenblatt Perceptron

Frank Rosenblatt (1958) Perceptron hardware
(electromechanical) implementation of the ANN
(effectively 1 neuron).

39
Rosenblatt Perceptron

One of the goals of the experiment was to train
the neuron, i.E. to make it go active whenever
specific pattern appears on retina
The neuron was to be trained with examples
The experimenter (teacher) was to expose the
neuron to the different patterns and in each case
tell it, whether it should fire, or not
The learning algorithm should do best to make
neuron do what the teacher requires

40
Perceptron learning rule

Kind of Hebbian rule modification (weight
correction depends on the error between actual
and desired output)

41
Supervised scheme
42
Supervised scheme

One training example the pair ltinput value,
desired outputgt is called a training pair
The set of all the training pairs is called
training set

43
Unsupervised scheme
44
Example of supervised learning

Linear Associator

45
Neural networks

A set of processing elements implementing each
other
The neurons (PEs) are interconnected. The output
of each neuron can be connected to the input of
every neuron, including itself

46
Neural networks

If there is a path of propagation (direct or
indirect) between the output of a neuron and its
own input, we have feedbacks - such structures
are called recurrent
If there is no feedback in a network, such
structure is called feedforward

47
What does recurrent mean?

recurrent definition is a definition of a concept
is a definition using the very same concept (but
perhaps in lower complixity setup)
recurent function is a function calling itself
classical recurrent definition factorial
function

48
Recurrent connection

function calling itself

49
Recurrent connection

At any given moment, the whole history of past
excitations influences neuron output
The concept of temporal memory emerges
The past influences present to the degree
determined by the weight of the recurrent
connection
This weight is effectively a forgetting factor

50
Feedforward layered network
51
Our brain

There are ca 1011 neurons in our brain
Each of them is connected on averege to 1000
other neurons
There is only one connection per 10 billions of
other
If every neuron would be connected to each other,
our brain would have to be a few hundred meters
in diameter
There is a strong modularity

52
Our brain
A fragment af the neural network connecting
retina to the visual perception area of the brain
53
Our brain vs computers

The memory size estimation ca. 1014 connections
gives an estimated size 100TB (each connection
has a continous real weight)
Neurons are quite slow, capable of activating no
more than 200 times per second, but there are a
lot of them, that gives an estimate of 1016
floating point operations per second.

54
Neural networks vs computer

Many (1011) simple processing elements (neurons)
Massively parallel, distributed processing
The momory evenly distributed in the whole
structure, content addressable
Large fault tollerance

A few complex processing elements
Sequential, centralized processing
Compact, addressed by an index memory
Large fault vulnerability

55
How to train the whole network?

For the Perceptron the output of the neuron
could be compared to the desired value
But what with the layered structure? How to reach
the hidden neurons?
The original idea comes from experiments of
Widrow and Hoff in 60s
The global error optimization using gradient
descent has been used

56
Supervised scheme once again
57
Error minimization

The error function component can be quite
elaborately defined
But the goal is always to minimize the error
One widely used technique of function
optimization (minimization/maximization) is
called gradient descent

58
Error function

One cycle of training consists of the
presentation of many training pairs it is
called one epoch of learning
The error accumulated for the whole epoch is an
average

59
Why quadratic function?
60
Error function once again

As subsequent input/output pairs are averaged
out, we can think of the error function mainly
as a function of weights w
The goal of learning to choose weights in such
way, that the error would be minimized

61
Error function derivative
Derivative gives us information on whether the
function increases or decreases when the argument
increases (and how fast)
The function is falling, then the sign of the
derivative is negative
We want to minimize the function value, thus we
have to increase the argument.
wi
62
The gradient rule
63
Error function gradient

In multidimensional case we have to do with a
vector of error function partial derivatives with
respect to each dimension (gradient)

64
Gradient method
The metod of moving against the gradient is
commonly called hill-climbing
65
Gradient method
66
Steepest descent demo

MATLAB demonstration

67
Other form of activation function

So called sigmoidal function, e.g.

68
Other form of activation function
ß1
ß100
ß0.4
69
Backpropagation algorithm
70
Backpropagation algorithm
71
Chain rule

Applies chain rule of differentiation

That makes possible to transfer the error
backward toward hidden units
72
Chain rule
Backward propagation through neuron
73
Backpropagation through neuron
74
Backpropagation through neuron
75
Backpropagation through neuron
76
Backpropagation through neuron
77
Backpropagation through neuron
78
Backpropagation through neuron
79
Backpropagation through neuron
80
Backpropagation through neuron
81
Backpropagation through neuron

Conclusion if we know the error function
gradient with respect to the output of the
neuron, we can compute the gradient with respect
to each of its weights
In general, our goal is to propagate the error
function gradient from the output of the network
to the outputs of the hidden units

82
Backpropagation

Additional problem in general, each hidden
neuron is connected to more than one neuron of
the next layer
There are many paths for the error gradient to be
transmitted backward from the next layer

83
Error backpropagation
84
Backpropagation through layer

Applying the rule of derivation for function of
compound arguments

we can propagate the error gradient through the
layer

85
Backpropagation through layer
86
Backpropagation through layer
87
Backpropagation through layer
Ogólniej
88
Backpropagation through layer
89
Forward propagation
The activations of the neurons are propagated
90
Forward propagation
a1
w11
z1
w12
a2
w13
a3
The activations of the neurons are propagated
91
Backpropagation
a2
The error function gradient is propagated
92
Backpropagation
w12
a2
w22
The error function gradient is propagated
93
Single algorithm cycle
94
Forward propagation

One cycle of algorithm
get inputs of the current layer
compute the excitations of the considered layer,
transferring inputs through the layer of
weights (multiplying the inputs by the
corresponding weights and performing the
summation)
calculate the activations of the layers neurons
by transferring the neuron excitations through
the activation functions
Repeat that cycle, starting with the layer 1 on
to the output layer. The activations of neurons
of the output layer are the outputs of the network

95
Backpropagation

One cycle of the algorithm
get error function gradients with respect to the
outputs of the layer
compute the error gradients with respect to the
excitations of the layers neurons by
transferring the gradients backward through the
derivatives of the neuron activation functions
compute the error function gradients with respect
to the outputs of the prior layer by transferring
the so far computed gradients through the layer
of weights (multiplying the gradients by the
corresponding weights and performing the
summation)

96
Backpropagation

Repeat that cycle starting from the last layer
the error function gradients can be computed
directly on toward the first layer. The
gradients computed through the process can be
used to calculate gradients with respect to the
weights

97
BP Algorithm

It all ends up with an computationally effective
and elegant procedure to compute partial
derivative of the error function with respect to
every weight in a network.
It allows us to correct every weight of a network
in such a way co reduce the error
Repeating the process on and on gradually reduces
the error and constitutes the learning process

98
Example source code (MATLAB)
99
Learning rate

Term ? is called learning rate
The faster, the better, but too fast can cause
the learning process to become unstable

100
Learning rate

In practice we have to manipulate the learning
rate during the course of learning process
The strategy of the constant learning rate is not
too good

101
Two types of problems

Data grouping/classification
Function approximation

102
Classification
103
Classification

Alternative scheme

0 (1)
1 (1)
2 (1)
3 (1)
4 (1)
...
...
5 (1)
6 (1)
7 (90)
8 (1)
9 (1)
Brak decyzji
104
Classification typical applications

Classification Pattern recognition
medical diagnosis
fault condition recognition
handwriting recognition
object identification
decision support

105
Classification example

Applet Character recognition

106
Classification

Assumes that a class is a group of similar
objects
Similarity has to be defined
Similar objects objects having similar
attributes
We have to describe the attributes

107
Classification

E.g. some of the human attributes
Height
Age
Class K Tall people under 30

108
Classification

Object O1 belonging to the class K
A person 180 cm high, 23 years old
Object O2 that doesnt belong to the class K
A person 165cm high, 35 years old

(180, 23) (165, 35)
109
Classification
110
The similarity of objects
111
The similarity

Euklidean distance (Euclidean metric)

112
Other metrics

Manhattan metric
113
Classification

The more attributes the more dimensions

114
Multidimensional metric
115
Multidimensional data

OLIVE presentation

116
Classification
Atr 2
Atr 4
Atr 6
Atr 1
Atr 3
Atr 5
Atr 8, itd. ..
Atr 7
117
Classification
YKX AGE KHEIGHT
AGE gt KHEIGHT

Wytyczenie granicy miedzy dwoma grupami

WIEK lt KHEIGHT
118
Classification
AGE KHEIGHTB AGE-KHEIGHT-B0 AGEK2HEIGHT
B20
AGE
35
23
HEIGHT
119
Classification

In general, for the multidimensional case, so
called classification hiperplane is described by

We are very close to the McCulloch-Pitts ...

120
McCulloch-Pitts
121
Neuron as a simple classifier

Single McCullocha-Pittsa threshold unit performs
a linear dichotomy (separation of two classes in
the multidimensional space)
Tuning the weights and threshold changes the
orientation of the separating hyperplane

122
Neuron as a simple classifier

If we tune the weights properly (train the neuron
properly), it will classify the processed objects
Processing an object means exposing the object
attributes on the neuron inputs

123
More classes

More neurons a network
Every neuron performs a bisection of the feature
space
A few neurons partitions the space to a few
distinct areas

124
Sigmoidal activation function
125
Classification example

NeuroSolutions Principal Component

126
Complicated separation border

Neurosolutions Support Vector Machine

127
Aproksymacja
X
Y
?
128
Example

True phenomenon

129
Example

There is only a limited number of observations

130
Example

And the observations are corrupted

131
Typical situation

We have a small amount of data
Data is corrupted (we are not certain of how
reliable it is)

132
Example

The experimenter sees only the data

133
Experimenter/system task

To fill the gaps?
We would call that an interpolation
But what we truly think of is an approximation
looking for a model (trace), which is most
similar (approximate) to the unknown (!) true
phenomenon

134
Example

We can apply e.g. a MATLAB polyfit

135
Polyfit

Polynomial approximation

136
Example

Polyfit with 2nd order polynomial

137
Example

But how come we know, we should apply the 2nd
order polynomial?

138
Example

And what if we apply 15th degree? It fits the
date much better (but it doesnt fit the original
well)

139
The variance factor

The higher the degree the more freaky it gets
15th degree is quite flexible can be fit to
many things
However, the generalization is sacrificed the
model fits well the data, but most probably would
fail on other data that would come later
Thats closing too much to the modelling the
variance of the data

140
Example

We could also insist on the 1st order

141
Example

... or even, the 0th order (the data are almost
completely ignored)...

142
The bias factor

Lower polynomial degree means lower flexibility
Arbitral model degree choice is what we called an
inductive bias
It is a kind of a priori knowledge, we introduce
In case of 0th and 1st order the bias is too
strong

143
Polyfit

A polynomial
Training set
Polyfit

144
Approximation

Linear model
A model employing polynomials (linear as well)

145
Aproksymacja

Uogólniony model liniowy

146
Approximation

hk() funcunctions can be various polynomial,
sinus,
Can be sigmoid as well

147
Approximation

ANN can do a linear model...

148
Approximation

But can do much more!

149
ANN transfer function

This looks like nonlinear function, indeed ...

150
Approximation

An Artificial Neural Network build on processing
elements with sigmoidal activation functions is
an universal approximator for the functions of
class C1 (continuous to the first derivative)
Hornik, 1983
Every typical transfer function can be modelled,
with an arbitrary precision, provided there is an
appropriate number of neurons

151
Przyklad aproksymacji funkcji