Title: Multivariate classification. SIMCA
1Multivariate classification. SIMCA
- Sergey Kucheryavskiy
- svk_at_aaue.dk
2Lecture outline
- Basic theory
- What classification is?
- Types and stages
- Geometrical view
- How estimate classification results?
- Multivariate classification with SIMCA
- Introduction
- Examples
- Conclusions
3Part I. Basic theory
4Classification questions and answers
- How to recognize fake pills by their spectra?
- How to distinguish different types of glasses
knowing metal oxides content? - How to recognize a human personality
(introvert/extravert) using questionnaire answers?
Classification discrimination (arranging) of
objects to several groups (classes) by finding an
analogy in their features values
5What we have?
- Object anything person, object, phenomenon,
process... - Features set of variables and their values,
describing the object - Group or class set of objects having a similar
(analogous) features - Example (People dataset)
Object a person Features height, weight, hair
length, swimming ability, ... Groups sex,
location
6Geometrical view
- Features variables, axes of coordinate space
- Objects points in this space
- Classes subspaces of the variable space
hypercube, hypersphere, etc
Class 2
features
Class 2
Class 1
Class 1
objects
7Classification methods
Aim
Methods
Conditions
Level of knowledge
No information about are there any classes and
how many
Looking for analogies in objects location in
feature space
Looking for objects grouping
We know how many classes but do not know what
class an object belongs to
Looking for analogies in objects location feature
space
Find a groupings reason, do data clasterization
Unsupervised Classification
We have a calibration dataset with known classes
and objects
Prediction of class of unknown samples
Calibrate a classification model
Supervised Classification
8Example linear discriminant analysis
features
Step 1. There is initial values 26 samples, 2
features, we know nothing about any groups
objects
9Example linear discriminant analysis
Step 2. Visual analysis of geometrical
representation of the data gives us the
information about grouping
Step 3. The reason of grouping certain
combination of variables values is found
10Example linear discriminant analysis
- Step 4. Building a classification model
- Find the eqation of separation line
- Find the equation of perpendicular line
- Find the equation for samples projection
y
y lt 0
y gt 0
y
11Classification methods
- One class classification
- A classification model is calibrated for each
class - Model gives a binary prediction
- 0 sample belongs to the class
- 1 sample does not belong to the class
- Multiple classes classification
- A model, describing several classes is calibrated
- Model gives a number of class as a prediction
- Multiple classes classification can be done using
several one class classification models!
12The world is not so green and easy
Often, in real data samples are not discriminated
clearly. There could be some outliers. How to
estimate the quality of classification?
13Classification errors
- Classification errors
- Type I errors false negatives a sample
belongs to the class but model said no - Type II errors false positives, a sample does
not belong to the class, but model said yes - Decreasing type I errors leads to increase errors
of type II and vice versa
The choice depends on the problem!
14Classification errors
- Decrease of type I errors when it is very
important not to lose a class sample - Examples hazardous substances recognition,
medicine diagnosis - Decrease of type II errors when it is more
important not to classify a wrong sample - Examples legal procedure (presumption of
innocence)
15How to compare classification models?
16Back to our example
- Find centroids center of classes
- Calculate distances to samples
- Analize distance plot
17Coomans (distance) plot
Distance to the center of class 2
Samples of class 1 far from class 2 close to
class 1
Outliers far from class 2 far from class1
Samples of class 2 far from class 1 close to
class 1
Samples belongs to both classes close to class
1 close to class 2
Distance to the center of class 1
18Coomans plot
Distance to the center of class 2
S2
S1
S2
Distance to the center of class 1
S1
19Conclusions
- Classification process of arranging objects
into two or more groups (classes) - Supervised classification implies
- Building a classification model using calibration
set - Estimation of discrimination power of the model
- Using model to predict a class for new unknown
samples
20Classification methods
- Simple
- Linear and quadratic discriminant analysis
- K-nearest neighbors
- Cluster analysis
- More complex
- Bayesian classification
- Support vector machines
- Neural networks
21Known problems
- Number of features 10, 100, 1000
- It is not possible to provide visual analysis
- It is quite difficult to find what features are
relevant for the problem - Data contains noise and outliers
- Variables are correlated
How to tackle these problems? Using projection
methods!
22Part II. Multivariate classification. SIMCA
23Multivariate classification
- Main idea using projection on latent variables
instead of original samples/variables - Unsupervised classification
- PCA classical algorithms
- Supervised classification
- SIMCA
- PLS DA
24SIMCA
- SIMCA (Soft Independent Modeling of Class
Analogy) - Object may belongs to several classes which is
quite typical for real data - Basic idea build separate models for each class
- Proposed by Svante Wold, in 1970th
25SIMCA main steps
- Step one build separate PCA model for each class
- Different models could have different number of
latent variables - When calibrate a model be careful about outliers.
If data preprocessing is needed it should be the
same for each model
26SIMCA main steps
- Step two applying PCA models for each samples
and analyze distances and plots
- Distance between models
- Distance from model to sample
- Leverage of sample for each model
- Coomans plot
- And
- Modeling power of variables
- Discrimination power of variables
27SIMCA models
28Glass dataset
- Pieces of two types glasses
- Vehicle headlights
- Street lamps
39 samples ? 5 variables
29PCA analysis
How to make classification model?
30SIMCA calibrate separate PCA models
31SIMCA results
Discrimination power
Modelling power
- Shows the ability of each variable to
discriminate between two models
- Shows the influence of each variable over the
model
32SIMCA results
Model distcance
Coomans plot
- Shows how different the models are from each
other
- Distance from samples to two models
33SIMCA results
- Distance to the model and leverage
34SIMCA results
- Distance to the model and leverage
35SIMCA results
36SIMCA conclusions
- Simple and efficient methods for supervised
classification - Allows to find and exclude outliers on the
calibration stage - Allows to compare models for each class
- Allows to arrange a sample to several classes
- For one-class classification needs only
calibration set with samples from this class!
37Example 1
- Wine dataset
- 178 samples three types of wine grown in the
same region in Italy but derived from three
different cultivars - Calibration set 166 samples
- Test set 12 samples
- 13 variables Alcohol, Malic acid, Ash,
Alcalinity, Magnesium, Total phenols, Flavanoids,
Nonflavanoid Ph, Proanthocyanins, Color
intensity, Hue, OD280/OD315, Proline
38Example 2
- SOT dataset (pills)
- 30 samples NIR spectra of three pills types
- genuine originals
- analogue the same medicine, but produced by
other company - fake counterfeit pills
- 1500 ?-variables NIR spectra of the pills
39SIMCA step by step
- Data preprocessing
- Projection methods are very sensitive to data
preprocessing. If there is no a priory
information try at least centering and
autoscaling - Preliminary analysis
- Build and analyze PCA model for the whole dataset
are there any groups, outliers and other
aperiodicities? - Separate class modeling
- Calibrate separate PCA models for each class
samples. Analyze scores and loadings plots for
outliers and other anomalies. Save models.
40SIMCA step by step
- Analyze class models
- Use separate test to discover how good your
models are. Classify samples from the test set
and analyze all plots Coomans, Leverage vs.
Model distance, Distance between models and so
on. Set the proper value for significance level. - Classification of unknown samples
- Simply use your model to classify new, unknown
samples, keeping in mind the value of possible
classification errors for your model.