Lecture 2: Basics and definitions - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 2: Basics and definitions

Description:

Pattern: the opposite of chaos; it is an entity, vaguely defined, that could be ... Iris patterns. Medical imaging (various screening procedures) Remote sensing ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 29
Provided by: AndyPhil8
Category:

less

Transcript and Presenter's Notes

Title: Lecture 2: Basics and definitions


1
Lecture 2 Basics and definitions
  • Networks as Data Models

2
Last lecture an artificial neuron
Bias input 1
x0 1
b1
w11
x1
y1
w12
x2
w1m
yi
xm
3
  • Thus the artificial neuron is defined by the
    components
  • A set of inputs, xi.
  • A set of weights, wij.
  • A bias, bi.
  • An activation function, f.
  • Neuron output, y
  • The subscript i indicates the i-th input or
    weight.

As the inputs and output are external, the
parameters of this model are therefore the
weights, bias and activation function and thus
DEFINE the model
4
More layers means more of the same parameters
(and several subscripts)
x1
x2
Input (visual input)
Output (Motor
output)
xn
Hidden layers
5
Network as a data model
  • Can view a network as a model which has a set of
    parameters associated with it
  • Networks transform input data into an output
  • Transformation is defined by the network
    parameters
  • Parameters set/adapted by optimisation/adaptive
    procedure learning (Haykin, 99)
  • Idea is that given a set of data points network
    (model) can be trained so as to generalise

6
NNs for function approximation
  • that is, network learns a (correct) mapping
    from inputs to outputs
  • Thus NNs can be seen as being a multivariate
    non-linear mapping and are often used for
    function approximation
  • 2 main categories
  • Classification given an input say which class it
    is in
  • Regression given an input what is the expected
    output

7
LEARNING extracting principles from data
  • Mapping/function needs to be learnt various
    methods available
  • Learning process used shapes final solution
  • Supervised learning have a teacher, telling you
    where to go
  • Unsupervised learning no teacher, net learns
    by itself
  • Reinforcement learning have a critic,
    wrong or correct
  • Type of learning used depends on task at hand. We
    will deal mainly with supervised and unsupervised
    learning. Reinforcement learning will be taught
    in Adaptive Systems course or can be found in eg
    Haykin or Hertz et al. or
  • Sutton R.S., and Barto A.G. (1998)
    Reinforcement learning an introduction MIT Press

8
Pattern recognition
  • Pattern the opposite of chaos it is an entity,
    vaguely defined, that could be given a name or a
    classification
  • Examples
  • Fingerprints,
  • Handwritten characters,
  • Human face,
  • Speech (or deer/whale/bat etc) signals,
  • Iris patterns
  • Medical imaging (various screening procedures)
  • Remote sensing etc etc etc.

9
Given a pattern a. supervised classification
(discriminant analysis) in which the input
pattern is identified as a member of a predefined
class b. unsupervised classification (e.g.
clustering ) in which the patter is assigned to
a hitherto unknown class. Unsupervised
methods will be discussed further in future
lectures
10
Eg Handwritten digit classification
a
b
First need a data set to learn from sets of
characters How are they represented? Eg as an
input vector x (x1, , xn) to the network (eg
vector of ones and zeroes for each pixel
according to whether it is black/white). Set of
input vectors is our Training Set X which has
already been classified into as and bs (note
capitals for set , X, underlined small letters
for an instance of set, xi ie the ith training
pattern/vector) Given a training set X, our goal
is to tell if a new image is an a or b ie
classify it into one of 2 classes C1 (all as) or
C2 (all bs) (in general one of k classes C1.. Ck
)
11
Generalisation
Q. How do we tell if a new unseen image is an a
or b? A. Brute force have a library of all
possible images But 256 x 256 pixels gt 2256 x
256 10158,000 images Impossible! Typically have
less than a few thousand images in training
set Therefore, system must be able to classify
UNSEEN patterns from the patterns it has
seen I.e. Must be able to generalise from the
data in the training set Intuition real neural
networks do this well, so maybe artificial ones
can do the same. As they are also shaped by
experiences maybe well also learn about how the
brain does it ...
12
For 2 class classification we want the network
output y (a function of the inputs and network
parameters) to be y(x, w) 1 if x is an a
y(x, w) -1 if x is a b where x is an input
vector and the network parameters are grouped as
a vector w. y is known as a discriminant
function it discriminates between 2 classes
13
  • As the network mapping is defined by the
    parameters we must use the data set to perform
    Learning (training, adaptation) ie
  • change weights or interaction between neurons
    according to the training examples (and possibly
    prior knowledge of the problem)
  • Where the purpose of learning is to minimize
  • training errors on learning data
    learning error
  • prediction errors on new, unseen data
    generalization error

Since when the errors are minimised, the network
discriminates between the 2 classes We therefore
need an error function to measure the network
performance based on the training error An
optimisation algorithms can then be used to
minimise the learning errors and train the network
14
Feature Extraction
However, if we use all the pixels as inputs we
are going to have a long training procedure and a
very big network May want to analyse the data
first (pre-process it) and extract some (lower
dimensional) salient features to be the inputs to
the network
Could use the ratio of height and width of letter
as bs will tend to be higher than as (Prior
knowledge) Also, scale invariant
x
x
feature extraction
feature space
pattern space (data)
15
Could then make a decision based on this feature.
Suppose we make a histogram of the values of x
for the input vectors in the training set X
For a new input with an x value of A we would
classify it as C1 as it is more likely to belong
to this class
16
Therefore, get the idea of a Decision Boundary
Points on one side of the boundary are in one
class, and on the other are in the other class
ie if x lt d pattern is in C1 else it is in
C2 Intuitively it makes sense (and is optimal in
a Bayesian sense) to place it where the 2
histograms cross
17
Can then view pattern recognition as the process
of assigning patterns to one of a number of
classes by dividing up the feature space with
decision boundaries, which thus divides the
original space
18
However, can be lots of overlap in this case so
could use a rejection threshold e where if x lt
d - e pattern is in C1 if x gt d e pattern is
in C2 else use refer to a better/different
classifier Related to the idea of minimising
Risk where it may be more important to not
misclassify in one class rather than the
other Especially important in medical
applications. Can serve to shift the decision
boundary one way or the other based on the Loss
function which defines the relative
importance/cost of the different errors
?
19
Alternatively can use more features
Here, use of any one feature leads to significant
overlap (imagine projections onto the axes) but
use of both gives a good separation
x2
x
x
x
x
x
x
x
x1
However, cannot keep increasing number of
features as there will come a point where the
performance starts to degrade as there is not
enough data to provide a good estimate (cf using
256 x256 pixels)
20
Curse of dimensionality
  • Geometric eg suppose we want to approximate a 1d
    function y from m-dimensional training data. We
    could
  • divide each dimension into intervals (like
    histogram)
  • y value for interval mean y value of all points
    in the interval
  • Increase precision by increasing number of
    intervals
  • However, need at least 1 point in each interval
  • For k intervals in each dimension need gt km data
    points
  • Thus number of data points grows at least
    exponentially with the input dimension
  • Known as the Curse of Dimensionality
  • A function defined in high dimensional space is
    likely to be much more complex than a function
    defined in a lower dimensional space and those
    complications are harder to discern (Friedman
    95, in Haykin, 99)

21
Of course, the above is a particularly
inefficient way of using data and most NNs are
less susceptible However, only practical way to
beat the curse is to incorporate correct prior
knowledge In practice, we must make the
underlying function smoother (ie less complex)
with increasing input dimensionality Also try to
reduce the input dimension by pre-processing Main
ly, learn to live with the fact that perfect
performance is not possible data in the real
world sometimes overlaps. Treat input data as
random variables and instead look for a model
which has smallest probability of making a mistake
22
Multivariate regression
Type of function approximation try to
approximate a function from a set of (noisy)
training data. Eg suppose we have the function
y 0.5 0.4 sin(2px), We generate training
data at equal intervals of x and add a little
ransom Gaussian noise with s.d. 0.05. We add
noise since in practical applications data will
inevitably be noisy We then test the model by
plugging in many values of x and viewing the
resultant function. This gives an idea of the
Generalisation peformance of the model
23
Eg suppose we have the function y 0.5 0.4
sin(2px),
We generate training data at equal intervals of x
(red circles) and add a little random Gaussian
noise with s.d. 0.05 and the model The model is
trained on this data
24
We then test the model (in this case a piecewise
linear model) by plugging in many values of x and
viewing the resultant function (solid blue line)
This gives an idea of the Generalisation
peformance of the model
25
Model Complexity
In the previous picture used a piecwise linear
function to approximate the data. Better to use a
polynomial y Saixi to approximate the data
ie y a0 a1x 1st order (straight
line) y a0 a1x a2x2 2nd order
(quadratic) y a0 a1x a2x2 a3x3 3rd
order y a0 a1x a2x2 a3x3 anxn nth
order As the order (highest power of x)
increases, so does the potential complexity of
the model/polynomial This means that it can
represent a more complex (non-smooth) function
and thus approximate the data more accurately
26
1st order model too simple
3rd order models underlying function well
10th order more accurate in terms of passing
thru data points but is too complex and
non-smooth (curvy)
27
As the model complexity grows performance
improves for a while but starts to degrade sfter
reaching an optimal level
Note though that training error continues to go
down as model matches the fine-scale detail of
the data (ie the noise) Rather want to model the
intrinsic dimensionality of the data otherwise
get the problem of overfitting Analagous to the
problem of overtraining where a model is trained
for too long and models the data too exactly and
loses its generality
28
Similar problems occur in classification
problems A model with too much flexibility does
not genralise well resulting in a non-smooth
decision boundary. Somewhat like giving a system
enough capacity to remember all training point
no need to generalise. Less memory gt it must
generalise to be able to model training
data Trade-off between being a good fit to the
training data and achieving a good
generalisation cf Bias-Variance trade-off
(later)
Write a Comment
User Comments (0)
About PowerShow.com