Title: An Introduction to Artificial Neural Networks
1An Introduction to Artificial Neural Networks
- Piotr Golabek, Ph.D.
- Radom Technical University
- Poland
- pgolab_at_pr.radom.net
2An overview of the lecture
- What are ANNs? What are they for?
- Neural networks as inductive machines inductive
reasoning tradition - The evolution of the concept keywords,
structures, algorithms
3An overview of the lecture
- Two general tasks classification and
approximation - Above tasks in more familiar setting decision
making, signal processing, control systems - live presentations
4What are ANNs?
- Dont ask me ...
- ANN is a set of processing elements (PEs),
influencing each other - (that definition suit almost everything...)
5What are ANNs
- ... but seriously...
- neural following biological
(neurophysiological) inspiration, - artificial dont forget these are not real
neurons! - networks strongly interconnected (in fact
massive parallel processing) - and the implicit meaning
- ANNs are learning machines, i.E. adapt, just as
biological neurons do
6Machine learning
- Important field of AI
- A computer program is said to learn from
experience E with respect to some class of tasks
T and performance measure P, if its performance
at tasks in T, as measured by P, improves with
experience E - (Take a look at Machine Learning by Tom
Mitchell)
7What is ANN?
- In case of ANNs, the Experience is input data
(examples) - The ANN is a inductive learning machine, i.E.
machine constructing internal generalized
concepts based on evidence brought by data stream - ANN learns from examples a paradigm shift
8What is ANN
- Structurally, ANN is a complex, interconnected
structure composed of simple processing elements,
often mimicking biological neurons - Functionally, ANN is an inductive learning
machine, it is able to undergo an adaptation
process (learning) driven by examples
9What are ANNs used for?
- Recognition of images, OCR
- Recognition of time signal signatures vibration
diagnostic, sonar signal interpretation,
detection intrusion patterns in various
transaction systems - Trend prediction, esp. in financial markets (bond
rating prediction) - Decision support, eg. in credit assessment,
medical diagnosis - Industrial process control, eg. the melting
parameters in metallurgical processes - Adaptive signal filtering to restore the
information from corrupted source
10Inductive process
- Concepts rooted in epistemology (episteme
knowledge) - Heraclitus The nature likes to hide
- Observations vs the true nature of the phenomenon
- The empiric (experimental) method of developing
the model (hypothesis) of the true phenomenon
the inductive process - Something like this goes on during ANN learning
11ANN as inductive learning machine
- The theory the way ANN behaves
- Experimental data examples the ANN learns
from - New examples cause the ANN to change its
behaviour, in order to fit better to the evidence
brought by examples
12Inductive process
- Inductive bias - the initial theory (a priori
knowledge) - Variance the evidence brought by data
- The strong bias prevents the data to affect the
theory - The weak bias makes the theory vulnerable to the
data corruption - The game is to properly set the bias-variance
balance
13ANN as inductive learning machines
- We can shape the inductive bias of learning
process e.g. by tuning the number of neurons - The more neurons, the more flexible the network
(the more sensitive to data)
14Inductive vs deductive reasoning
- Reasoning premises ? conclusions
- Deductive reasoning the conclusions are more
specific than premises (we just reason the
consequences) - Inductive reasoning the conclusions are more
general than premises (we reason the general
rules governing the phenomenon from the specific
examples)
15The main goal of inductive reasoning
- The main goal To achive the good generalization
to reason the rule general enough, that it fits
to any futer data - This is also the main goal of machine learning
to use the experience in order to build good
enough performance (in every possible future
situation)
16McCulloch-Pitts model
Warren McCulloch
A Logical Calculus Immanent in Nervous
Activity, 1943
17McCulloch-Pitts model
- Logical calculus approach
- elementary logical operations AND, OR, NOT
- basic reasoning operator, implication
- (given premises p, we draw conclusion q)
-
18McCulloch-Pitts model
- Logical operators are functions
- Truth tables
x y x ? y
0 0 1
0 1 1
1 0 0
1 1 1
x y x AND y
0 0 0
0 1 0
1 0 0
1 1 1
x y x OR y
0 0 0
0 1 1
1 0 1
1 1 1
x NOT x
0 1
1 0
19McCulloch-Pitts model
- The working question whether a neuron can
perform logical functions AND, OR, NOT - If the answer is yes, the chain of implications
(reasoning) could be implemented in neural network
20McCulloch-Pitts model
Inputs
Weights
Neuron output (activation)
Summation
Total exicitation
Activation function
Activation threshold
21McCulloch-Pitts transfer function
22Implementation of AND, OR, NOT
23Including threshold into weights
24McCulloch-Pitts model
25(vector dot product)
26(vector dot product)
x
x
w
w
max antisimilarity
max dissimilarity (orthogonality)
max similarity
27Vector dot product interpretation
- Inputs are called input vector
- weights are called weight vector
- Neuron excites, when input vector is similar
enough to the weight vector - Weight vector is a template for some set of
input vectors
28Neurons elements of the ANNs
- Dont be fooled...
- These are our neurons ...
29Neurons elements of the ANNs
Single neuron (stereoscopic)
30Neurony - elementy skladowe sieci neuronowych
31The real neuron
Synaptic connection organic structure
32The real neuron
Synaptic connection the molecular level
33McCulloch-Pitts model
- The conclusion
- If we tune the weights of the neuron properly, we
can make it implement the transfer function we
need (AND, OR, NOT) - The question
- What the weights of neurons are tuned in our
brains, what is the adaptation mechanism
34Adaptacja neuronu
- Donald Hebb (1949, neurophysiologist)When an
axon of cell A is near enough to excite a cell B
and repeatedly or persistently takes part in
firing it, some growth process or metabolic
change takes place in one or both cells such that
As efficiency, as one of the cells firing B, is
increased.
35Hebb rule
36Hebb rule
- It is a local rule of adaptation
- The multiplication of input and output signifies
a correlation between them - The rule is unstable a weight can grow without
limits - (that doesnt happen in nature, where there
are limited resources) - numerous modifications of the Hebb rule has been
proposed, to make it stable
37Hebb rule
- Hebb rule is very important and useful ...
- ... but for now we want to make the neuron to
learn the function we need
38Rosenblatt Perceptron
- Frank Rosenblatt (1958) Perceptron hardware
(electromechanical) implementation of the ANN
(effectively 1 neuron).
39Rosenblatt Perceptron
- One of the goals of the experiment was to train
the neuron, i.E. to make it go active whenever
specific pattern appears on retina - The neuron was to be trained with examples
- The experimenter (teacher) was to expose the
neuron to the different patterns and in each case
tell it, whether it should fire, or not - The learning algorithm should do best to make
neuron do what the teacher requires
40Perceptron learning rule
- Kind of Hebbian rule modification (weight
correction depends on the error between actual
and desired output)
41Supervised scheme
42Supervised scheme
- One training example the pair ltinput value,
desired outputgt is called a training pair - The set of all the training pairs is called
training set
43Unsupervised scheme
44Example of supervised learning
45Neural networks
- A set of processing elements implementing each
other - The neurons (PEs) are interconnected. The output
of each neuron can be connected to the input of
every neuron, including itself
46Neural networks
- If there is a path of propagation (direct or
indirect) between the output of a neuron and its
own input, we have feedbacks - such structures
are called recurrent - If there is no feedback in a network, such
structure is called feedforward
47What does recurrent mean?
- recurrent definition is a definition of a concept
is a definition using the very same concept (but
perhaps in lower complixity setup) - recurent function is a function calling itself
- classical recurrent definition factorial
function
48Recurrent connection
49Recurrent connection
- At any given moment, the whole history of past
excitations influences neuron output - The concept of temporal memory emerges
- The past influences present to the degree
determined by the weight of the recurrent
connection - This weight is effectively a forgetting factor
50Feedforward layered network
51Our brain
- There are ca 1011 neurons in our brain
- Each of them is connected on averege to 1000
other neurons - There is only one connection per 10 billions of
other - If every neuron would be connected to each other,
our brain would have to be a few hundred meters
in diameter - There is a strong modularity
52Our brain
A fragment af the neural network connecting
retina to the visual perception area of the brain
53Our brain vs computers
- The memory size estimation ca. 1014 connections
gives an estimated size 100TB (each connection
has a continous real weight) - Neurons are quite slow, capable of activating no
more than 200 times per second, but there are a
lot of them, that gives an estimate of 1016
floating point operations per second.
54Neural networks vs computer
- Many (1011) simple processing elements (neurons)
- Massively parallel, distributed processing
- The momory evenly distributed in the whole
structure, content addressable - Large fault tollerance
- A few complex processing elements
- Sequential, centralized processing
- Compact, addressed by an index memory
-
- Large fault vulnerability
55How to train the whole network?
- For the Perceptron the output of the neuron
could be compared to the desired value - But what with the layered structure? How to reach
the hidden neurons? - The original idea comes from experiments of
Widrow and Hoff in 60s - The global error optimization using gradient
descent has been used
56Supervised scheme once again
57Error minimization
- The error function component can be quite
elaborately defined - But the goal is always to minimize the error
- One widely used technique of function
optimization (minimization/maximization) is
called gradient descent
58Error function
- One cycle of training consists of the
presentation of many training pairs it is
called one epoch of learning - The error accumulated for the whole epoch is an
average
59Why quadratic function?
60Error function once again
- As subsequent input/output pairs are averaged
out, we can think of the error function mainly
as a function of weights w - The goal of learning to choose weights in such
way, that the error would be minimized
61Error function derivative
Derivative gives us information on whether the
function increases or decreases when the argument
increases (and how fast)
The function is falling, then the sign of the
derivative is negative
We want to minimize the function value, thus we
have to increase the argument.
wi
62The gradient rule
63Error function gradient
- In multidimensional case we have to do with a
vector of error function partial derivatives with
respect to each dimension (gradient)
64Gradient method
The metod of moving against the gradient is
commonly called hill-climbing
65Gradient method
66Steepest descent demo
67Other form of activation function
- So called sigmoidal function, e.g.
68Other form of activation function
ß1
ß100
ß0.4
69Backpropagation algorithm
70Backpropagation algorithm
71Chain rule
- Applies chain rule of differentiation
That makes possible to transfer the error
backward toward hidden units
72Chain rule
Backward propagation through neuron
73Backpropagation through neuron
74Backpropagation through neuron
75Backpropagation through neuron
76Backpropagation through neuron
77Backpropagation through neuron
78Backpropagation through neuron
79Backpropagation through neuron
80Backpropagation through neuron
81Backpropagation through neuron
- Conclusion if we know the error function
gradient with respect to the output of the
neuron, we can compute the gradient with respect
to each of its weights - In general, our goal is to propagate the error
function gradient from the output of the network
to the outputs of the hidden units
82Backpropagation
- Additional problem in general, each hidden
neuron is connected to more than one neuron of
the next layer - There are many paths for the error gradient to be
transmitted backward from the next layer
83Error backpropagation
84Backpropagation through layer
- Applying the rule of derivation for function of
compound arguments
- we can propagate the error gradient through the
layer
85Backpropagation through layer
86Backpropagation through layer
87Backpropagation through layer
Ogólniej
88Backpropagation through layer
89Forward propagation
The activations of the neurons are propagated
90Forward propagation
a1
w11
z1
w12
a2
w13
a3
The activations of the neurons are propagated
91Backpropagation
a2
The error function gradient is propagated
92Backpropagation
w12
a2
w22
The error function gradient is propagated
93Single algorithm cycle
94Forward propagation
- One cycle of algorithm
- get inputs of the current layer
- compute the excitations of the considered layer,
transferring inputs through the layer of
weights (multiplying the inputs by the
corresponding weights and performing the
summation) - calculate the activations of the layers neurons
by transferring the neuron excitations through
the activation functions - Repeat that cycle, starting with the layer 1 on
to the output layer. The activations of neurons
of the output layer are the outputs of the network
95Backpropagation
- One cycle of the algorithm
- get error function gradients with respect to the
outputs of the layer - compute the error gradients with respect to the
excitations of the layers neurons by
transferring the gradients backward through the
derivatives of the neuron activation functions - compute the error function gradients with respect
to the outputs of the prior layer by transferring
the so far computed gradients through the layer
of weights (multiplying the gradients by the
corresponding weights and performing the
summation)
96Backpropagation
- Repeat that cycle starting from the last layer
the error function gradients can be computed
directly on toward the first layer. The
gradients computed through the process can be
used to calculate gradients with respect to the
weights
97BP Algorithm
- It all ends up with an computationally effective
and elegant procedure to compute partial
derivative of the error function with respect to
every weight in a network. - It allows us to correct every weight of a network
in such a way co reduce the error - Repeating the process on and on gradually reduces
the error and constitutes the learning process
98Example source code (MATLAB)
99Learning rate
- Term ? is called learning rate
- The faster, the better, but too fast can cause
the learning process to become unstable
100Learning rate
- In practice we have to manipulate the learning
rate during the course of learning process - The strategy of the constant learning rate is not
too good
101Two types of problems
- Data grouping/classification
- Function approximation
102Classification
103Classification
0 (1)
1 (1)
2 (1)
3 (1)
4 (1)
...
...
5 (1)
6 (1)
7 (90)
8 (1)
9 (1)
Brak decyzji
104Classification typical applications
- Classification Pattern recognition
- medical diagnosis
- fault condition recognition
- handwriting recognition
- object identification
- decision support
105Classification example
- Applet Character recognition
106Classification
- Assumes that a class is a group of similar
objects - Similarity has to be defined
- Similar objects objects having similar
attributes - We have to describe the attributes
107Classification
- E.g. some of the human attributes
- Height
- Age
- Class K Tall people under 30
108Classification
- Object O1 belonging to the class K
- A person 180 cm high, 23 years old
- Object O2 that doesnt belong to the class K
- A person 165cm high, 35 years old
(180, 23) (165, 35)
109Classification
110The similarity of objects
111The similarity
- Euklidean distance (Euclidean metric)
112Other metrics
Manhattan metric
113Classification
- The more attributes the more dimensions
114Multidimensional metric
115Multidimensional data
116Classification
Atr 2
Atr 4
Atr 6
Atr 1
Atr 3
Atr 5
Atr 8, itd. ..
Atr 7
117Classification
YKX AGE KHEIGHT
AGE gt KHEIGHT
- Wytyczenie granicy miedzy dwoma grupami
WIEK lt KHEIGHT
118Classification
AGE KHEIGHTB AGE-KHEIGHT-B0 AGEK2HEIGHT
B20
AGE
35
23
HEIGHT
119Classification
- In general, for the multidimensional case, so
called classification hiperplane is described by
- We are very close to the McCulloch-Pitts ...
120McCulloch-Pitts
121Neuron as a simple classifier
- Single McCullocha-Pittsa threshold unit performs
a linear dichotomy (separation of two classes in
the multidimensional space) - Tuning the weights and threshold changes the
orientation of the separating hyperplane
122Neuron as a simple classifier
- If we tune the weights properly (train the neuron
properly), it will classify the processed objects - Processing an object means exposing the object
attributes on the neuron inputs
123More classes
- More neurons a network
- Every neuron performs a bisection of the feature
space - A few neurons partitions the space to a few
distinct areas
124Sigmoidal activation function
125Classification example
- NeuroSolutions Principal Component
126Complicated separation border
- Neurosolutions Support Vector Machine
127Aproksymacja
X
Y
?
128Example
129Example
- There is only a limited number of observations
130Example
- And the observations are corrupted
131Typical situation
- We have a small amount of data
- Data is corrupted (we are not certain of how
reliable it is)
132Example
- The experimenter sees only the data
133Experimenter/system task
- To fill the gaps?
- We would call that an interpolation
- But what we truly think of is an approximation
looking for a model (trace), which is most
similar (approximate) to the unknown (!) true
phenomenon
134Example
- We can apply e.g. a MATLAB polyfit
135Polyfit
136Example
- Polyfit with 2nd order polynomial
137Example
- But how come we know, we should apply the 2nd
order polynomial?
138Example
- And what if we apply 15th degree? It fits the
date much better (but it doesnt fit the original
well)
139The variance factor
- The higher the degree the more freaky it gets
- 15th degree is quite flexible can be fit to
many things - However, the generalization is sacrificed the
model fits well the data, but most probably would
fail on other data that would come later - Thats closing too much to the modelling the
variance of the data
140Example
- We could also insist on the 1st order
141Example
- ... or even, the 0th order (the data are almost
completely ignored)...
142The bias factor
- Lower polynomial degree means lower flexibility
- Arbitral model degree choice is what we called an
inductive bias - It is a kind of a priori knowledge, we introduce
- In case of 0th and 1st order the bias is too
strong
143Polyfit
- A polynomial
- Training set
- Polyfit
144Approximation
- Linear model
- A model employing polynomials (linear as well)
145Aproksymacja
146Approximation
- hk() funcunctions can be various polynomial,
sinus, - Can be sigmoid as well
147Approximation
- ANN can do a linear model...
148Approximation
149ANN transfer function
- This looks like nonlinear function, indeed ...
150Approximation
- An Artificial Neural Network build on processing
elements with sigmoidal activation functions is
an universal approximator for the functions of
class C1 (continuous to the first derivative)
Hornik, 1983 - Every typical transfer function can be modelled,
with an arbitrary precision, provided there is an
appropriate number of neurons
151Przyklad aproksymacji funkcji
- Applet Java function approximation
152Where to go now?
- This set of slides
- http//pr.radom.net/pgolabek/Antwerp/NNIntro.ppt
- Be sure to check the comp-ai.neural-nets FAQ
- http//www.faqs.org/faqs/ai-faq/neural-nets/
- Books
- Simon Haykin Neural networks a comprehensive
direction - Christopher Bishop Neural networks for pattern
recognition - Neural and adaptive systems the
NeuroSolutions interactive book (www.nd.com)
153Where to go now
- Software
- NeuroSolution www.nd.com
- MATLAB Neural Networks Toolbox
- SNNS - Stuttgart Neural Network Simulator
- and countless other