PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION

Description:

Grouping organisms according to similarity from chemical fingerprints. DNA base pairs, proteins ... Petal width, petal length, sepal width, sepal length ... – PowerPoint PPT presentation

Number of Views:1272

Avg rating:3.0/5.0

Slides: 41

Provided by: richardb165

Category:

more less

Transcript and Presenter's Notes

Title: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION

1
PATTERN RECOGNITION CLUSTERING AND
CLASSIFICATION Richard Brereton r.g.brereton_at_br
is.ac.uk
2

CLUSTER ANALYSIS - UNSUPERVISED PATTERN
RECOGNITION
Grouping of objects according to similarity.
No predefined classes

3
TAXONOMY
4

CHEMICAL TAXONOMY
Grouping organisms according to similarity from
chemical fingerprints
DNA base pairs, proteins
NMR and pyrolysis of extracts
NIR spectra

SIMILAR PRINCIPLES IN ALL TYPES OF CHEMISTRY
Chemical archaeology
Environmental samples
Food

6
STEPS IN CLUSTER ANALYSIS Similarity
measures. Calculate similarity between
objects. Example
7
Correlation coefficient higher, more similar
Euclidean distance smaller, more similar
8
Manhattan distance smaller, more similar
9
Use correlations for illustration. Group
samples. 1. Find most similar, highest
correlation. Objects 2 and 5. 2. Combine
them. 3. Work out new correlation of the new
object 25 with the other objects (1,3,4,6).
10

Linkage methods determination of new similarity
measures of groups.
Several methods.
Nearest neighbour uses the highest correlation
Furthest neighbour uses the lowest correlation
Average linkage uses an average.
Illustrate with nearest neighbour.

11
(No Transcript)
12
Dendrograms
13

CLUSTER ANALYSIS SUMMARY
Similarity measures
Linkage methods
Dendrogram

14
CLASSIFICATION Many methods. CONVENTIONAL LDA
(Linear discriminant analysis) Original
statistics projections
15
Examples Orange juices, can we class into
origins and can we detect adulteration from NIR
spectra? Class modelling of mussels, can we
find which come from polluted site from
GC? Detailed mathematical model
16
PRINCIPLES BIVARIATE EXAMPLE
17
Often exact cut-off impossible
18
Class distance plots
19
Multivariate data several measurements per
class Example Fisher Iris data four
measurements per iris Petal width, petal length,
sepal width, sepal length 150 Irises, divided
into 50 of each species
I. Setosa
I. Versicolor
I. Verginica
20
SPECIAL DISTANCES USED. Linear discriminant
function between classes A and B

The first term is simply the difference between
the centres of each class so a more positive
value indicates class A.
The middle term is the inverse of the pooled
variance covariance matrix.
What does this mean?
Sometimes measurements are correlated.
Sometimes classes are more dispersed.
Puts distances on common scale.
The final term is the measurement for each
object.

21
(No Transcript)
22

Can shift the scale so that
positive score probably class A,
negative score probably class B.
Note some ambiguities. WAB.

23
Extending to more than 2 classes Three classes
2 out of 3 possible discriminant parameters

If we have 3 classes and choose to use WAB and
WAC as the functions, it is easy to see that
an object belongs to class A if WAB and WAC are
both positive,
an object belongs to class B if WAB is negative
and WAC is greater than WAB, and
an object belongs to class C if WAC is negative
and WAB is greater than WAC.

24
(No Transcript)
25
Mahalanobis distance Similar idea to the
Euclidean distance, i.e. distance to the centre
of a class but use the variance covariance matrix
for scaling.
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
Many classical statistical methods developed
first in biology. Problem for chemists
Mahalanobis distance depends on measurements
being more than variables Spectroscopy,
chromatography often a huge number of
measurements per sample.
31

Solutions
Variable selection
PCA prior to performing classification

Many diagnostics
Modelling power of variables
Discriminatory power of variables
Quality of class model
Probabilities of class membership
Ambiguous classification is analytical data
good enough?

MANY SOPHISTICATIONS
Large number of methods for classification based
on LDA.
Bayesian methods based on prior probabilities.
Methods that try to find optimal groupings before
class modelling.

LOTS OF INFORMATION
Class membership
Outliers
Whether another new class
Is a class well defined or are there subclasses
e.g. subspecies or species from different
environments
What measurements are most useful for
discrimination. Can we reduce the number of
measurements?
Are there ambiguous samples, and if so do we need
more or better measurements?
Replicates analysis. Is our method sufficiently
good for repeatability. Clinical diagnostics.