Title: Statistical Learning Theory
1Statistical Learning Theory
2Statistical Learning Theory
- A model of supervised learning consists of
- a) Environment
- - Supplying a vector with a fixed but
unknown pdf - b) Teacher. It provides a desired response d for
every according to a conditional pdf - . These are related by
3Statistical Learning Theory
- v is a noise term.
- c) Learning machine. It is capable of
imple-menting a set of I/O mapping functions - where y is the actual response and is a
set of free parameters (weights) selected from
the parameter (weight) space .
4Statistical Learning Theory
- The supervised learning problem is that of
selecting the particular that
approximates d in an optimum fashion. The
selection itself is based on a set of iid
training samples - Each sample is drawn from with a joint pdf
5Statistical Learning Theory
- Supervised learning depends on the following
- Do the training examples
contain enough information to construct a LM
capable of good generalization? - To answer, we will see this problem as an
approximation problem. We wish to find the
function which is the best
possible approximation to .
6Statistical Learning Theory
- Let
denote a measure of the discrepancy
between a d corresponding to a vector and the
actual response produced by - The expected value of the loss is defined by the
risk functional
7Statistical Learning Theory
- The risk functional may be easily understood from
the finite approximation - where denotes the probability of
drawing the i-th sample.
8Principle of Empirical Risk Minimization
- Instead of using we use an empirical
measure - This measure differs from in two
desirable ways - a) It does not depend on the unknown pdf
- explicitly.
9Principle of Empirical Risk Minimization
- b) In theory it can be minimized with respect to
. - -------
- Let and denote the
weight vector and the mapping that minimize - Also, let and denote the
ana-logues for - Both and correspond to the space
. -
10Principle of Empirical Risk Minimization
- We must now consider under which condi-tions
is close to as
measured by the mismatch between - and .
11Principle of Empirical Risk Minimization
- 1. In place of , construct
- on the basis of the training set of iid samples
- i 1, ..., N
12Principle of Empirical Risk Minimization
- 2. converges in probability to the
mi-nimum possible values of as
provided that converges uniformly
to . - 3. Uniform convergence as per
- is necessary and sufficient for consistency of
the PERM.
13The Vapnik Chervonenkis Dimension
- The theory of uniform convergence of
- to includes rates of convergence based
on a parameter called the VC dimension. - It is a measure of the capacity or expressive
power of the family of classification functions
realized by the learning machine.
14The Vapnik Chervonenkis Dimension
- To describe the concept of VC dimension let us
consider a binary pattern classification problem
for which the desired response is - .
- A dichotomy is a classification function. Let
denote the set of dichotomies implemented by a
learning machine
15The Vapnik Chervonenkis Dimension
- Let denote the set of N points in the
m-dimensional space of input vectors - A dichotomy partitions into two disjoint
sets and such that
16The Vapnik Chervonenkis Dimension
- Let denote the number of distinct
dichotomies implemented by the L.M. - Let denote the maximum
over all with . - is shattered by if
. That is, if all the possible dichotomies of
can be induced by functions in .
17The Vapnik Chervonenkis Dimension
- In the figure we illus-
- trate a two-dimensional
- space consisting of 4
- points (x1,...,x4). The
- decision boundaries of
- F0 and F1 correspond
- to the classes 0 and 1
- being true. F0 induces
- the dichotomy
18The Vapnik Chervonenkis Dimension
- While F1 induces
- with the set consisting of four
- points, the cardinality
- Hence,
19The Vapnik Chervonenkis Dimension
- We now formally define the VC dimension as
- The VC dimension of an ensemble of dichotomies
is the cardinality of the largest set
that is shattered by .
20The Vapnik Chervonenkis Dimension
- In more familiar terms, the VC dimension of the
set of classification functions - is the maximum number of training examples that
can be learned by the machine without error for
all possible labelings of the classification
functions.
21Importance of the VC Dimension
- Roughly speaking, the number of examples needed
to learn a class of interest reliably is
proportional to the VC dimension. - In some cases the VC dimension is determined by
the free parameters of a Neural Network. - In this regard, the following two results are of
interest.
22Importance of the VC Dimension
- 1. Let denote an arbitrary feedforward
network built up from neurons with a threshold
activation function - the VC dimension of is O(W logW) where W
is the total number of free parameters in the
network.
23Importance of the VC Dimension
- 2. Let denote a multilayer feedforward
network whose neurons use a sigmoid activation
function - the VC dimension is O(W2), where W is the number
of free parameters in the network.
24Importance of the VC Dimension
- In the case of binary pattern classification the
loss function has only two possible values - The risk functional R( ) and the empirical
risk functional Remp( ) assume the following
interpretations
25Importance of the VC Dimension
- R( ) is the probability of classification
error denoted by P( ). - Remp( ) is the training error, denoted by
- v( ).
- Then (Haykin, p.98)
26Importance of the VC Dimension
- The notion of VC provides a bound on the rate of
uniform convergence. For the set of
classification functions with VC dimension h the
following inequality holds -
(vc.1) - where N is the size of the training sample. In
other words, a finite VC dimension is a necessary
and sufficient condition for uniform convergence
of the principle of empirical risk minimization.
27Importance of the VC dimension
- The factor in (vc.1) represents
a bound on the growth function for
the family of functions - for Provided that this function
does not grow too fast, the right hand side will
go to zero as N goes to infinity. - This requirement is satisfied if the VC dimension
is finite.
28Importance of the VC Dimension
- Thus, a finite VC dimension is a necessary and
sufficient condition for uniform convergence of
the principle of empirical risk minimization. - Let denote the probability of occurrence of
the event - using the previous bound (vc.1) we find
-
(vc.2)
29Importance of the VC Dimension
- Let denote the special
value of that satisfies (vc.2). Then we obtain
(Haykin, 99) - We refer to as the confidence interval.
30Importance of the VC Dimension
31Importance of the VC Dimension
- Conclusions
- 1.
- 2. For a small training error (close to zero)
- 3. For a large training error (close to unity)
32Structural Risk Minimization
- The training error is the frequency of errors
made during the training session for some machine
with weight vector during the training
session. - The generalization error is the frequency of
errors made by the machine when it is tested with
examples not seen before. - Let this two errors to be denoted with
- and .
33Structural Risk Minimization
- Let h be the VC dimension of a family of
classification functions - with respect to the input space
- The generalization error is
lower than the guaranteed risk defined by the sum
of competing terms - where the confidence interval
- is defined as before.
34Structural Risk Minimization
- For a fixed number of training samples N, the
training error decreases monotonically as the
capacity or h is increased, whereas the
confidence interval increases monotonically.
35Structural Risk Minimization
- The training error is the frequency of errors
made during the training session for some machine
with weight vector during the training
session. - The generalization error is the frequency of
errors made by the machine when it is tested with
examples not seen before. - Let this two errors to be denoted with
- and .
36Structural Risk Minimization
- The training error is the frequency of errors
made during the training session for some machine
with weight vector during the training
session. - The generalization error is the frequency of
errors made by the machine when it is tested with
examples not seen before. - Let this two errors to be denoted with
- and .
37Structural Risk Minimization
- The challenge in solving a supervised learning
problem lies in realizing the best generalization
performance by matching the machine capacity to
the available amount of training data for the
problem at hand. The method of structural risk
minimization provides an inductive procedure to
achieve this goal by making the VC dimension of
the learning machine a control variable.
38Structural Risk Minimization
- Consider an ensemble of pattern classifiers
- and define a nested structure of n such machines
- such that we have
- correspondingly, the VC dimensions of the
indivi-dual pattern classifiers satisfy - which implies that the VC dimension of each
classifier is finite (see next figure)
39- Illustration of relationship between training
error, confidence interval and guaranteed risk
40Structural Risk Minimization
- Then
- a) The empirical risk (training error) of each
classifier is minimized - b) The pattern classifier with the
smallest guaranteed risk is identified this
particular machine provides the best compromise
between the training error (quality of
approximation) and the confidence interval
(complexity of the approximation function).
41Structural Risk Minimization
- Our goal is to find a network structure such that
decreasing the VC dimension occurs at the
expense of the smallest possible increase in
trainig error. - We achieve this, for example, varying h by
varying the number of hidden neurons. - We evaluate the ensemble of fully connected
multilayer feedforward networks in which the
number of neurons in one of the hidden layers is
increased in a monotonic fashion.
42Structural Risk Minimization
- The principle of SRM states that the best network
in this ensemble is the one for which the
guaranteed risk is the minimum.