Title: Giansalvo EXIN Cirrincione
1unit 7/8
Neural Networks and Pattern Recognition
Giansalvo EXIN Cirrincione
2ERROR FUNCTIONS
part one
Goal for REGRESSION to model the conditional
distribution of the output variables, conditioned
on the input variables.
Goal for CLASSIFICATION to model the posterior
probabilities of class membership, conditioned on
the input variables.
3ERROR FUNCTIONS
Basic goal for TRAINING model the underlying
generator of the data for generalization on new
data.
The most general and complete description of the
generator of the data is in terms of the
probability density p(x,t) in the joint
input-target space.
For a set of training data xn, tn drawn
independently from the same distribution
4ERROR FUNCTIONS
discrete (classification class membership)
continuous (classification probability of class
membership )
continuous (regression prediction)
5OLS approach
Sum-of-squares error
- c target variables tk
- the distributions of the target variables are
independent - the distributions of the target variables are
Gaussian - error ?k ? N( 0, ? ) ? doesnt depend on x or
on k
6Sum-of-squares error
7Sum-of-squares error
w minimizes E
The optimum value of ?2 is proportional to the
residual value of the sum-of-squares error
function at its minimum.
Of course, the use of a sum-of-squares error
doesnt require the target data to have a
Gaussian distribution. However, if we use this
error, then the results cannot distinguish
between the true distribution and any other
distribution having the same mean and variance.
8Sum-of-squares error
training
validation
- E 0 ? perfect prediction of the test data
- E 1 ? it is predicting the test data in the
mean
9linear output units
MLP,RBF
10linear output units
11linear output units
- reduction of the number of iterations (smaller
search space) - greater cost per iteration
12linear output units
13Interpretation of network inputs
For a network trained by minimizing a
sum-of-squares error function, the outputs
approximate the conditional averages of the
target data
Consider the limit in which the size N of the TS
goes to infinity
14Interpretation of network inputs
15Interpretation of network inputs
The network mapping is given by the conditional
average of the targets
regression of tk conditioned on x
16Interpretation of network inputs
This result doesnt depend on the choice of
network architecture or even if using a neural
network at all. However, anns provide a framework
for approximating arbitrary nonlinear
multivariate mappings and can therefore in
principle approximate the conditional average to
arbitrary accuracy.
- KEY ASSUMPTIONS
- the TS must be sufficiently large that it
approximates an infinite TS
- the network output must be sufficiently general
(weights for minimum)
- training in such a way as to find the
appropriate minimum of the cost
17Interpretation of network inputs
18Interpretation of network inputs
19Interpretation of network inputs
20Interpretation of network inputs
The sum-of-squares error function cannot
distinguish between the true distribution and a
Gaussian distribution having the same x-dependent
mean and average variance.
21ERROR FUNCTIONS
part two
Goal for REGRESSION to model the conditional
distribution of the output variables, conditioned
on the input variables.
Goal for CLASSIFICATION to model the posterior
probabilities of class membership, conditioned on
the input variables.
22We can exploit a number of results ...
Minimum error-rate decisions
Note that the network outputs need not be close
to 0 or 1 if the class-conditional density
functions are overlapping.
Goal for CLASSIFICATION to model the posterior
probabilities of class membership, conditioned on
the input variables.
23Minimum error-rate decisions
We can exploit a number of results ...
It can be enforced explicitly as part of the
choice of network structure.
Outputs sum to 1
The average of each output over all patterns in
the TS should approximate the corresponding prior
class probabilities.
These estimated priors can be compared with the
sample estimates of the priors obtained from the
fractions of patterns in each class within the
TS. Differences are an indication that the
network is not modelling the posterior
probabilities accurately.
24Minimum error-rate decisions
Outputs sum to 1
We can exploit a number of results ...
Compensating for different priors
Changes in priors can be accomodated without
retraining
25sum-of-squares for classification
1-of-c coding
every input vector in the TS is labelled by its
class membership, represented by a set of target
values tkn
26sum-of-squares for classification
1-of-c coding
every input vector in the TS is labelled by its
class membership, represented by a set of target
values tkn
27sum-of-squares for classification
The s-o-s error is not the most appropriate for
classification because it is derived from ML on
the assumption of Gaussian distributed target
data.
in the case of a 1-of-c coding scheme, the target
values sum to unity for each pattern, and so the
outputs will satisfy the same constraint
for a network with linear output units and s-o-s
error, if the target values satisfy a linear
constraint, then the outputs will satisfy the
same constraint for an arbitrary input
if the outputs represent probabilities, they
should lie in the range (0,1) and should sum to 1
no guarantee that the outputs lie in the range
(0,1)
28sum-of-squares for classification
two class problem
alternative approach
29Interpretation of hidden units
Total covariance matrix for the activations at
the output of the final hidden layer w.r.t. TS
linear output units
30Interpretation of hidden units
linear output units
31Interpretation of hidden units
Nothing is specific to MLP or indeed to anns. The
same result is obtained regardless of the
functions (of the weights) zj and applies to any
generalized linear discriminant in which the
kernels are adaptive.
linear output units
32Interpretation of hidden units
The weights in the final layer are adjusted to
produce an optimum discrimination of the classes
of input vectors by means of a linear
transformation. Minimizing the error of this
linear discriminant requires the input data
undergo a nonlinear transformation into the space
spanned by the activations of the hidden units in
such a way as to maximize the discriminant
function J.
linear output units
33Interpretation of hidden units
Strong weighting of the feature extraction
criterion in favour of classes with larger number
of patterns
linear output units
34Cross-entropy for two classes
- Hopfield (1987)
- Baum and Wilczek (1988)
- Solla et al. (1988)
- Hinton (1989)
- Hampshire and Pearlmutter (1990)
cross-entropy error function
35Cross-entropy for two classes
logistic activation function for the output
BP
- Natural pairing
- sum-of-squares linear output units
- cross-entropy logistic output unit
36Cross-entropy for two classes
0
it doesnt vanish when t n is continuous in the
range (0,1) representing the probability of the
input xn belonging to class C1
1-of-c coding
37Cross-entropy for two classes
- MLP
- one input unit
- five hidden units (tanh)
- one output unit (logistic)
- cross-entropy
- BFGS
example
38exponential family of distributions (e.g.
Gaussian, binomial, Bernoulli, Poisson)
sigmoid activation functions
The network output is given by a logistic sigmoid
activation function acting on a weighted linear
combination of the outputs of those hidden units
which send connections to the output unit.
Extension to the hidden units provided such
units use logistic sigmoids, their outputs can be
interpreted as probabilities of the presence of
corresponding features conditioned on the inputs
to the units.
39properties of the cross-entropy error
the cross-entropy error function performs better
than s-o-s at estimating small probabilities
the s-o-s error function depends on the absolute
errors (its minimization tends to result in
similar absolute errors for each pattern)
the error function depends on the relative errors
of the outputs (its minimization tends to result
in similar relative errors on both small and
large targets)
40properties of the cross-entropy error
Manhattan error function
- compared with s-o-s
- much stronger weight to smaller errors
- better for incorrectly labelled data
41justification of the cross-entropy error
as for s-o-s, the output of the network
approximates the conditional average of the
target data for the given input
42justification of the cross-entropy error
43Multiple independent attributes
Determine the probabilities of the presence or
absence of a number of attributes (which need not
be mutually exclusive).
Assumptionindependent attributes
multiple outputs
yk represents the probability that the kth
attribute is present
x
With this choice of error function, the outputs
should each have a logistic sigmoid activation
function
44Multiple independent attributes
Show that the entropy measure E, derived for
targets tk 0, 1, applies also in the case where
the targets are probabilities with values in
(0,1). Do this by considering an extended data
set in which each pattern tkn is replaced by a
set of M patterns of which a fraction M tkn is
set to 1 and the remainder is set to 0, and then
applying E to this extended TS.
45Cross-entropy for multiple classes
One output yk for each class
mutually exclusive classes
The probability of observing the set of target
values tkn dkl, given an input vector xn, is
just
The yk are not independent as a result of the
constraint Sk yk 1
The absolute minimum w.r.t.ykn occurs when ykn
tkn ?k, n
46Cross-entropy for multiple classes
If the output values are to be interpreted as
probabilities, they must lie in the range (0,1)
and sum to unity.
47Cross-entropy for multiple classes
As with the logistic sigmoid, we can give a
general motivation for the softmax by considering
the posterior probability that a hidden unit
activation z belongs to class Ck .
The outputs can be interpreted as probabilities
of class membership, conditioned on the outputs
of the hidden units.
48Cross-entropy for multiple classes
BP training
- Natural pairing
- sum-of-squares linear output units
- 2-class cross-entropy logistic output unit
- c-class cross-entropy softmax output units
49homework
50Consider the cross-entropy error function for
multiple classes, together with a network whose
outputs are given by a softmax activation
function, in the limit of an infinite data set.
Show that the network output functions yk(x)
which minimize the error are given by the
conditional averages of the target data
Since the outputs are not independent, consider
the functional derivative w.r.t. ak(x) instead.
51(No Transcript)
52FINE