Title: Nature%20Inspired%20Learning:%20Classification%20and%20Prediction%20Algorithms
1Nature Inspired Learning Classification and
Prediction Algorithms
- Šarunas Raudys
- Computational Intelligence Group
- Department of Informatics
- Vilnius University. Lithuania
- e-mail sarunas_at_raudys.com
- Juodkrante, 2009 05 22
2Nature inspired learning
Statics
Accuracy, and the relations between sample size,
and complexity
W S -1 (M1-M2)
Dynamics
learning rapidity
becomes a very important issue
perceptron
3(No Transcript)
4Nature inspired learning
y
output
nonlinearity
weighted sum
x1, x2, , xp i n p u t s
- A Non-linear
- Single Layer Perceptron -
- a main element in the ANN theory
5Nature inspired learning
- TRAINING THE SINGLE LAYER PERCEPTRON
OUTLINE - A plot of
300 bivariate vectors (dots and pluses) sampled
from - two Gaussian pattern classes, and the linear
decision boundary
Three tasks
FINISH
START
Minimization of deviations
CLASSIFICATION
CLUSTERIZATION, if target2 target1
6- 1. Cost function and training SLP used for
classification. - 2. When to stop training?
- 3. Six types of classification Equation while
training SLP - 1. Euclidean distance, (only means)
- 2 Regularized,
- Fisher, or
- Fisher with pseudo-inversion of S
- 5 Robust,
- 6. Minimal empirical error,
- 7 Support vector (maximal margin).
- How to train SLP in the best way?
FINISH
START
CLASSIFICATION 2 category case I will speak also
about the multi-category case
7Nature inspired learning
- Training the non-linear SLP
x1, x2, , xp y
output y
nonlinearity
1 2 N
Training Data
weighted sum
inputs X (x1, x2, , xp)
net f( VTX v0), where f(net ) is a
non-linear activation function, e.g. a sigmoid
function f (net) 1/(1e-net ) f
sigmoid(net), and v0, VT (v1 , v2 , ... , vp)
are the weights of the DF.
STANDART
8- TRAINING THE SINGLE LAYER PERCEPTRON BASED
CLASSIFIER
o f( VTX v0), where f(net ) is a
non-linear activation function, and v0, VT
(v1 , v2 , ... , vp) are the weights. Cost
function (Amari 1967 Tsypkin, 1966) C 1/N
S (yj - f( VTXj v0))2, Vt1 Vt - h x
gradient, Training where h is a learning step
parameter and yj is training signal
(desired output)
x1, x2, , xp y
1 2 N
Training Data
Rule
Optimal stopping
V(FINISH) mimimum of the cost function
V(0)
A true (unknown) minimum
9 1 2 N
- Training the Non-linear Single Layer Perceptron
Training Data
Vt1 Vt - h x gradient
True landscape
Videal
Finish
Training data landscape
Optimal stopping
10 Vt1 Vt - h x gradient
Early stopping
Vopt aoptVstart (1-aopt)Vfinish, where
RaudysAmari, 1998
accuracy
A general Principle
Late stopping
Majority, who stopped too late, are here.
11Nature inspired learning
- Where to use Early stopping? - Knowledge
discovery in very large databases
Data Set 2
Data Set 1
Train, however, stop training early!
In order to save previous information, stop
training early!
Data Set 3
12Standard sum of squares cost function Standard
regression C 1/N S (yj f ( VTXj
v0))2. We assume that the data is normalized
Covariances
Let correlations between input variables x1, x2,
, xp be zero. Then components of vector V
will be proportional to correlations between x1,
x2, , xp and y. We may obtain such regression
after the first iteration.
Gradient descent training algorithm Vt1 Vt -
h x gradient
13SLP AS SIX REGRESSIONS
START
14Nature inspired learning. Robust regression
(yj - VTXj)2
robust
yj - VTXj
Š. Raudys (2000). Evolution and generalization of
a single neurone. III. Primitive, regularized,
standard, robust and minimax regressions. Neural
Networks 13 (3/4)507-523.
In order to obtain robust regression, instead of
square function we have to
use robust function
15A real world problem. Use of robust regression
in order to distinguish very weak baby signal
from mothers ECG. Robust regression pays
attention to smallest deviations, not to
the largest ones considered as the outliers.
Mother and a fetus (baby) ECG. Two signals
Result the fetus signal
16 Nature inspired learning. Standard and
regularized regression
Use of statistical methods to perform diverse
whitening data transformations, where the
input variables x1, x2, , xp are decorrelated
and scaled in order to have the same variances.
Then while training the perceptron in the
transformed feature space, we can obtain standard
regression after the very first iteration.
XnewT Xold T L-1/2 F, where SXX F L
FT is a singular value decomposition of the
covariance matrix SXX.
Vstart 0,
Speeding up the calculations (a converegence)
If SXX SXX l I, we obtain regularized
regression. Moreover, we can equalize
eigenvalues and speed up training process.
17SLP AS SEVEN STATISTICAL CLASSIFIERS
Large weights
Small weights
The simplest classifier
START
18Nature inspired learning
Conditions to obtain Euclidean distance
classifier just after the first iteration
When we train further, we have regularized
discriminant analysis (RDA)
V t1 (2/(t-1)/h I S) -1 (M1-M2)
- is regularization parameter, l 0 with an
increase in the number of training iterations
Fisher classifier,
or Fisher classifier with pseudoinverse of the
covariance matrix
19Nature inspired learning. Standard approach.
- Use the diversity of statistical methods and
multivariate models in order to obtain efficient
estimate of covariance matrix. Then perform
whitening data transformations, where the input
variables are decorrelated and scaled in order to
have the same variances. - While training the perceptron in the
transformed feature space, we can obtain the
Euclidean distance classifier after the first
iteration. In original feature space it
corresponds to the Fisher classifier or to
modification of the Fisher (it depends on a
method used to estimate covariance matrix) in
original feature space.
Fisher classifier
Untransformed data
Transformed data
Euclidean classifier Fisher in original space
Euclidean classifier
20Nature inspired learning
- Generalisation errors. EDC, Fisher and Quadratic
classifiers
21A real world problem. Dozens of ways used to
estimate covariance matrix and perform
whitening data transformation. It is an
additional information (if correct), that can be
useful in SLP training
196-dimensional data
S. Raudys, M. Iwamura. Structures of covariance
matrix in handwritten character recognition.
Lecture Notes in Computer Science, 3138, 725-733,
2004. S. Raudys, A. Saudargiene. First-order
tree-type dependence between variables and
classification performance. IEEE Trans. on
Pattern Analysis and Machine Intelligence. Vol.
PAMI-23 (2), pp. 233-239, 2001.
22Covariance matrices are different.
Decision boundaries of EDC, LDF, QDF and
Anderson- Bahadur linear DF. AB and F are
different.
If we would start with the AB decision boundary,
not with the Fisher, it would be better. Hence,
we have proposed a special method of input data
transformation.
Q Fisher AB
S. Raudys (2004). Integration of statistical and
neural methods to design classifiers in case of
unequal covariance matrices. Lecture Notes in
Artificial intelligence, Springer-Verlag. Vol.
3238, pp. 270-280
23Non-linear discrimination. Similarity features
LNCS 3686, pp. 136 145, 2005
b
a
SV classifier
KDA
Generalization error
SV classifier
c
d
SLP
optimal stopping of SLP
SLP
epochs
100100 2D two class training vectors (pluses and
circles) and decision boundaries of Kernel
Discriminant Analysis (a), SVM (b), SLP trained
in 200D dissimilarity feature space (c). Learning
curve generalization error of SLP classifier as
a function of number of training epochs (d).
24Nature inspired learning. A noise injection
A coloured noise, used to form
pseudo-validation set we are adding a noise in
directions of closest training vectors. So, we
almost do not distort geometry of the
data. In this technique, we use
additional information a space between
neighboring points in multidimensional feature
space is not empty it is filled by vectors of
the same class.
A pseudo-validation data set
used to realize early stopping
25Nature inspired learning. Multi-category cases
1
2
Pair-wise classifiers optimally stopped (noise)
SLPs H-T fusion. Wee need to obtain the
classifier (SLP) of optimal complexity Early
stopping
26Learning Rapidity. Two Pattern Recognition (PR)
tasks
A time to learn the second task is restricted,
say 300 training epochs
Parameters that affect learning rapidity h
learning step the weights growth
s target1 target2
Regularization a) weight decay term, b) a
noise injection to input vectors, c) a corruption
of the targets Wstart w x Wstart. w also
controls learning rapidity
h, s, and w
27- Optimal values of learning
parameters
of epochs
h, s, and w
s
s, and w
s target1 target2
w
h the learning step
28Collective learning. A l e n g t h y
sequence of diverse PR tasks
The angle and/or the time between two changes are
varying all the time
29The multi-agent system composed of adaptive
agents the single layer perceptrons
In order to survive the agents should learn
rapidly. Unsuccessful agents are replaced by
newborn. Inside the group the agents help each
other. In a case of emergency, they help to the
weakest groups. Genetics learning and adaptive
one.
A moral a single agent (SLP) can not learn very
long sequence of the PR tasks successfully
30A power of the PR task changes and parameter s
as a function of time
A power of the changes
PR task changes
s t1-t2
s is following the variation of the power of the
changes
I tried to learn s, emotions, altruism,
the noise intensity, a length of learning set,
e.t.c.
31Integrating Statistical Methods and Neural
Networks. Nature inspired learning
Regression Neural Networks, 13 (3/4), pp.
507-523, 2000
The theory for equal covariance matrix case
The theory for unequal covariance matrices and
multicategory cases LNCS, 4432, pp. 1 10, 2007
LNCS, 4472, pp. 6271, 2007 LNCS, 4142, pp. 47
56, 2006 LNAI, 3238, pp. 270-280, 2004
JMLR, ICNC'08
32(No Transcript)
33(No Transcript)