Title: Lecture 2: Basics and definitions
1Lecture 2 Basics and definitions
2Last lecture an artificial neuron
Bias input 1
x0 1
b1
w11
x1
y1
w12
x2
w1m
yi
xm
3- Thus the artificial neuron is defined by the
components - A set of inputs, xi.
- A set of weights, wij.
- A bias, bi.
- An activation function, f.
- Neuron output, y
- The subscript i indicates the i-th input or
weight.
As the inputs and output are external, the
parameters of this model are therefore the
weights, bias and activation function and thus
DEFINE the model
4More layers means more of the same parameters
(and several subscripts)
x1
x2
Input (visual input)
Output (Motor
output)
xn
Hidden layers
5Network as a data model
- Can view a network as a model which has a set of
parameters associated with it - Networks transform input data into an output
- Transformation is defined by the network
parameters - Parameters set/adapted by optimisation/adaptive
procedure learning (Haykin, 99) - Idea is that given a set of data points network
(model) can be trained so as to generalise
6NNs for function approximation
- that is, network learns a (correct) mapping
from inputs to outputs - Thus NNs can be seen as being a multivariate
non-linear mapping and are often used for
function approximation - 2 main categories
- Classification given an input say which class it
is in - Regression given an input what is the expected
output
7LEARNING extracting principles from data
- Mapping/function needs to be learnt various
methods available - Learning process used shapes final solution
- Supervised learning have a teacher, telling you
where to go - Unsupervised learning no teacher, net learns
by itself - Reinforcement learning have a critic,
wrong or correct - Type of learning used depends on task at hand. We
will deal mainly with supervised and unsupervised
learning. Reinforcement learning will be taught
in Adaptive Systems course or can be found in eg
Haykin or Hertz et al. or - Sutton R.S., and Barto A.G. (1998)
Reinforcement learning an introduction MIT Press
8Pattern recognition
- Pattern the opposite of chaos it is an entity,
vaguely defined, that could be given a name or a
classification - Examples
- Fingerprints,
- Handwritten characters,
- Human face,
- Speech (or deer/whale/bat etc) signals,
- Iris patterns
- Medical imaging (various screening procedures)
- Remote sensing etc etc etc.
9Given a pattern a. supervised classification
(discriminant analysis) in which the input
pattern is identified as a member of a predefined
class b. unsupervised classification (e.g.
clustering ) in which the patter is assigned to
a hitherto unknown class. Unsupervised
methods will be discussed further in future
lectures
10Eg Handwritten digit classification
a
b
First need a data set to learn from sets of
characters How are they represented? Eg as an
input vector x (x1, , xn) to the network (eg
vector of ones and zeroes for each pixel
according to whether it is black/white). Set of
input vectors is our Training Set X which has
already been classified into as and bs (note
capitals for set , X, underlined small letters
for an instance of set, xi ie the ith training
pattern/vector) Given a training set X, our goal
is to tell if a new image is an a or b ie
classify it into one of 2 classes C1 (all as) or
C2 (all bs) (in general one of k classes C1.. Ck
)
11Generalisation
Q. How do we tell if a new unseen image is an a
or b? A. Brute force have a library of all
possible images But 256 x 256 pixels gt 2256 x
256 10158,000 images Impossible! Typically have
less than a few thousand images in training
set Therefore, system must be able to classify
UNSEEN patterns from the patterns it has
seen I.e. Must be able to generalise from the
data in the training set Intuition real neural
networks do this well, so maybe artificial ones
can do the same. As they are also shaped by
experiences maybe well also learn about how the
brain does it ...
12For 2 class classification we want the network
output y (a function of the inputs and network
parameters) to be y(x, w) 1 if x is an a
y(x, w) -1 if x is a b where x is an input
vector and the network parameters are grouped as
a vector w. y is known as a discriminant
function it discriminates between 2 classes
13- As the network mapping is defined by the
parameters we must use the data set to perform
Learning (training, adaptation) ie - change weights or interaction between neurons
according to the training examples (and possibly
prior knowledge of the problem) - Where the purpose of learning is to minimize
- training errors on learning data
learning error - prediction errors on new, unseen data
generalization error
Since when the errors are minimised, the network
discriminates between the 2 classes We therefore
need an error function to measure the network
performance based on the training error An
optimisation algorithms can then be used to
minimise the learning errors and train the network
14Feature Extraction
However, if we use all the pixels as inputs we
are going to have a long training procedure and a
very big network May want to analyse the data
first (pre-process it) and extract some (lower
dimensional) salient features to be the inputs to
the network
Could use the ratio of height and width of letter
as bs will tend to be higher than as (Prior
knowledge) Also, scale invariant
x
x
feature extraction
feature space
pattern space (data)
15Could then make a decision based on this feature.
Suppose we make a histogram of the values of x
for the input vectors in the training set X
For a new input with an x value of A we would
classify it as C1 as it is more likely to belong
to this class
16Therefore, get the idea of a Decision Boundary
Points on one side of the boundary are in one
class, and on the other are in the other class
ie if x lt d pattern is in C1 else it is in
C2 Intuitively it makes sense (and is optimal in
a Bayesian sense) to place it where the 2
histograms cross
17Can then view pattern recognition as the process
of assigning patterns to one of a number of
classes by dividing up the feature space with
decision boundaries, which thus divides the
original space
18However, can be lots of overlap in this case so
could use a rejection threshold e where if x lt
d - e pattern is in C1 if x gt d e pattern is
in C2 else use refer to a better/different
classifier Related to the idea of minimising
Risk where it may be more important to not
misclassify in one class rather than the
other Especially important in medical
applications. Can serve to shift the decision
boundary one way or the other based on the Loss
function which defines the relative
importance/cost of the different errors
?
19Alternatively can use more features
Here, use of any one feature leads to significant
overlap (imagine projections onto the axes) but
use of both gives a good separation
x2
x
x
x
x
x
x
x
x1
However, cannot keep increasing number of
features as there will come a point where the
performance starts to degrade as there is not
enough data to provide a good estimate (cf using
256 x256 pixels)
20Curse of dimensionality
- Geometric eg suppose we want to approximate a 1d
function y from m-dimensional training data. We
could - divide each dimension into intervals (like
histogram) - y value for interval mean y value of all points
in the interval - Increase precision by increasing number of
intervals - However, need at least 1 point in each interval
- For k intervals in each dimension need gt km data
points - Thus number of data points grows at least
exponentially with the input dimension - Known as the Curse of Dimensionality
- A function defined in high dimensional space is
likely to be much more complex than a function
defined in a lower dimensional space and those
complications are harder to discern (Friedman
95, in Haykin, 99)
21Of course, the above is a particularly
inefficient way of using data and most NNs are
less susceptible However, only practical way to
beat the curse is to incorporate correct prior
knowledge In practice, we must make the
underlying function smoother (ie less complex)
with increasing input dimensionality Also try to
reduce the input dimension by pre-processing Main
ly, learn to live with the fact that perfect
performance is not possible data in the real
world sometimes overlaps. Treat input data as
random variables and instead look for a model
which has smallest probability of making a mistake
22Multivariate regression
Type of function approximation try to
approximate a function from a set of (noisy)
training data. Eg suppose we have the function
y 0.5 0.4 sin(2px), We generate training
data at equal intervals of x and add a little
ransom Gaussian noise with s.d. 0.05. We add
noise since in practical applications data will
inevitably be noisy We then test the model by
plugging in many values of x and viewing the
resultant function. This gives an idea of the
Generalisation peformance of the model
23Eg suppose we have the function y 0.5 0.4
sin(2px),
We generate training data at equal intervals of x
(red circles) and add a little random Gaussian
noise with s.d. 0.05 and the model The model is
trained on this data
24We then test the model (in this case a piecewise
linear model) by plugging in many values of x and
viewing the resultant function (solid blue line)
This gives an idea of the Generalisation
peformance of the model
25Model Complexity
In the previous picture used a piecwise linear
function to approximate the data. Better to use a
polynomial y Saixi to approximate the data
ie y a0 a1x 1st order (straight
line) y a0 a1x a2x2 2nd order
(quadratic) y a0 a1x a2x2 a3x3 3rd
order y a0 a1x a2x2 a3x3 anxn nth
order As the order (highest power of x)
increases, so does the potential complexity of
the model/polynomial This means that it can
represent a more complex (non-smooth) function
and thus approximate the data more accurately
261st order model too simple
3rd order models underlying function well
10th order more accurate in terms of passing
thru data points but is too complex and
non-smooth (curvy)
27As the model complexity grows performance
improves for a while but starts to degrade sfter
reaching an optimal level
Note though that training error continues to go
down as model matches the fine-scale detail of
the data (ie the noise) Rather want to model the
intrinsic dimensionality of the data otherwise
get the problem of overfitting Analagous to the
problem of overtraining where a model is trained
for too long and models the data too exactly and
loses its generality
28Similar problems occur in classification
problems A model with too much flexibility does
not genralise well resulting in a non-smooth
decision boundary. Somewhat like giving a system
enough capacity to remember all training point
no need to generalise. Less memory gt it must
generalise to be able to model training
data Trade-off between being a good fit to the
training data and achieving a good
generalisation cf Bias-Variance trade-off
(later)