Title: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION
1PATTERN RECOGNITION CLUSTERING AND
CLASSIFICATION Richard Brereton r.g.brereton_at_br
is.ac.uk
2- CLUSTER ANALYSIS - UNSUPERVISED PATTERN
RECOGNITION - Â
- Grouping of objects according to similarity.
- No predefined classes
3TAXONOMY
4- CHEMICAL TAXONOMY
- Grouping organisms according to similarity from
chemical fingerprints - DNA base pairs, proteins
- NMR and pyrolysis of extracts
- NIR spectra
5- SIMILAR PRINCIPLES IN ALL TYPES OF CHEMISTRY
- Chemical archaeology
- Environmental samples
- Food
6STEPS IN CLUSTER ANALYSIS Similarity
measures. Â Calculate similarity between
objects. Example
7Correlation coefficient higher, more similar
Euclidean distance smaller, more similar
8Manhattan distance smaller, more similar
9Use correlations for illustration. Â Â Group
samples. Â 1. Find most similar, highest
correlation. Objects 2 and 5. Â 2. Combine
them. 3. Work out new correlation of the new
object 25 with the other objects (1,3,4,6).
10- Linkage methods determination of new similarity
measures of groups. - Several methods.
- Nearest neighbour uses the highest correlation
- Furthest neighbour uses the lowest correlation
- Average linkage uses an average.
- Illustrate with nearest neighbour.
11(No Transcript)
12Dendrograms
13- CLUSTER ANALYSIS SUMMARY
- Similarity measures
- Linkage methods
- Dendrogram
14CLASSIFICATION Many methods. Â CONVENTIONAL Â LDA
(Linear discriminant analysis) Â Original
statistics projections
15Examples  Orange juices, can we class into
origins and can we detect adulteration from NIR
spectra? Â Class modelling of mussels, can we
find which come from polluted site from
GC? Â Â Detailed mathematical model
16PRINCIPLES BIVARIATE EXAMPLE
17Often exact cut-off impossible
18Class distance plots
19Multivariate data several measurements per
class Example Fisher Iris data four
measurements per iris Petal width, petal length,
sepal width, sepal length 150 Irises, divided
into 50 of each species
I. Setosa
I. Versicolor
I. Verginica
20SPECIAL DISTANCES USED. Linear discriminant
function between classes A and B
- The first term is simply the difference between
the centres of each class so a more positive
value indicates class A. - The middle term is the inverse of the pooled
variance covariance matrix. - What does this mean?
- Sometimes measurements are correlated.
- Sometimes classes are more dispersed.
- Puts distances on common scale.
- The final term is the measurement for each
object.
21(No Transcript)
22- Can shift the scale so that
- positive score probably class A,
- negative score probably class B.
- Note some ambiguities. WAB.
23Extending to more than 2 classes Three classes
2 out of 3 possible discriminant parameters
- If we have 3 classes and choose to use WAB and
WAC as the functions, it is easy to see that - an object belongs to class A if WAB and WAC are
both positive, - an object belongs to class B if WAB is negative
and WAC is greater than WAB, and - an object belongs to class C if WAC is negative
and WAB is greater than WAC.
24(No Transcript)
25Mahalanobis distance Similar idea to the
Euclidean distance, i.e. distance to the centre
of a class but use the variance covariance matrix
for scaling.
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30Many classical statistical methods developed
first in biology. Problem for chemists
Mahalanobis distance depends on measurements
being more than variables Spectroscopy,
chromatography often a huge number of
measurements per sample.
31- Solutions
- Variable selection
- PCA prior to performing classification
32- Many diagnostics
- Modelling power of variables
- Discriminatory power of variables
- Quality of class model
- Probabilities of class membership
- Ambiguous classification is analytical data
good enough? - Â
33- MANY SOPHISTICATIONS
- Large number of methods for classification based
on LDA. - Bayesian methods based on prior probabilities.
- Methods that try to find optimal groupings before
class modelling.
34- LOTS OF INFORMATION
- Class membership
- Outliers
- Whether another new class
- Is a class well defined or are there subclasses
e.g. subspecies or species from different
environments - What measurements are most useful for
discrimination. Can we reduce the number of
measurements? - Are there ambiguous samples, and if so do we need
more or better measurements? - Replicates analysis. Is our method sufficiently
good for repeatability. Clinical diagnostics.
35- SIMCA sometimes used in chemometrics as an
alternative - Â
- Soft
- Independent
- Modelling of
- Class analogy
36Use PCA models
37- Use PCA to model each class independently
- Choose optimal number of PCs
- Use distance from PC model as an indicator of
class distance
38VALIDATION OF A CLASS MODEL
- Procedure.
- Â
- Establish a training set.
- Assess model with a test set.
- Use model on real data.
- Â
- Information
- Â
- Graphical - e.g. diagrams
- Quantitative - class distances
- Quantitative - probability of membership of a
given class. - Â
39Training set
Test set
40- SUMMARY
- Cluster analysis unsupervised pattern
recognition - Similarity measures
- Linkage
- Dendrograms
- Classification supervised pattern recognition
- Class models
- Class distances
- Graphical methods