PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION

Description:

Grouping organisms according to similarity from chemical fingerprints. DNA base pairs, proteins ... Petal width, petal length, sepal width, sepal length ... – PowerPoint PPT presentation

Number of Views:1272
Avg rating:3.0/5.0
Slides: 41
Provided by: richardb165
Category:

less

Transcript and Presenter's Notes

Title: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION


1
PATTERN RECOGNITION CLUSTERING AND
CLASSIFICATION Richard Brereton r.g.brereton_at_br
is.ac.uk
2
  • CLUSTER ANALYSIS - UNSUPERVISED PATTERN
    RECOGNITION
  •  
  • Grouping of objects according to similarity.
  • No predefined classes

3
TAXONOMY
4
  • CHEMICAL TAXONOMY
  • Grouping organisms according to similarity from
    chemical fingerprints
  • DNA base pairs, proteins
  • NMR and pyrolysis of extracts
  • NIR spectra

5
  • SIMILAR PRINCIPLES IN ALL TYPES OF CHEMISTRY
  • Chemical archaeology
  • Environmental samples
  • Food

6
STEPS IN CLUSTER ANALYSIS Similarity
measures.   Calculate similarity between
objects. Example
7
Correlation coefficient higher, more similar
Euclidean distance smaller, more similar
8
Manhattan distance smaller, more similar
9
Use correlations for illustration.     Group
samples.  1. Find most similar, highest
correlation. Objects 2 and 5.   2. Combine
them. 3. Work out new correlation of the new
object 25 with the other objects (1,3,4,6).
10
  • Linkage methods determination of new similarity
    measures of groups.
  • Several methods.
  • Nearest neighbour uses the highest correlation
  • Furthest neighbour uses the lowest correlation
  • Average linkage uses an average.
  • Illustrate with nearest neighbour.

11
(No Transcript)
12
Dendrograms
13
  • CLUSTER ANALYSIS SUMMARY
  • Similarity measures
  • Linkage methods
  • Dendrogram

14
CLASSIFICATION Many methods.   CONVENTIONAL   LDA
(Linear discriminant analysis)   Original
statistics projections
15
Examples   Orange juices, can we class into
origins and can we detect adulteration from NIR
spectra?   Class modelling of mussels, can we
find which come from polluted site from
GC?     Detailed mathematical model
16
PRINCIPLES BIVARIATE EXAMPLE
17
Often exact cut-off impossible
18
Class distance plots
19
Multivariate data several measurements per
class Example Fisher Iris data four
measurements per iris Petal width, petal length,
sepal width, sepal length 150 Irises, divided
into 50 of each species
I. Setosa
I. Versicolor
I. Verginica
20
SPECIAL DISTANCES USED. Linear discriminant
function between classes A and B
  • The first term is simply the difference between
    the centres of each class so a more positive
    value indicates class A.
  • The middle term is the inverse of the pooled
    variance covariance matrix.
  • What does this mean?
  • Sometimes measurements are correlated.
  • Sometimes classes are more dispersed.
  • Puts distances on common scale.
  • The final term is the measurement for each
    object.

21
(No Transcript)
22
  • Can shift the scale so that
  • positive score probably class A,
  • negative score probably class B.
  • Note some ambiguities. WAB.

23
Extending to more than 2 classes Three classes
2 out of 3 possible discriminant parameters
  • If we have 3 classes and choose to use WAB and
    WAC as the functions, it is easy to see that
  • an object belongs to class A if WAB and WAC are
    both positive,
  • an object belongs to class B if WAB is negative
    and WAC is greater than WAB, and
  • an object belongs to class C if WAC is negative
    and WAB is greater than WAC.

24
(No Transcript)
25
Mahalanobis distance Similar idea to the
Euclidean distance, i.e. distance to the centre
of a class but use the variance covariance matrix
for scaling.
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
Many classical statistical methods developed
first in biology. Problem for chemists
Mahalanobis distance depends on measurements
being more than variables Spectroscopy,
chromatography often a huge number of
measurements per sample.
31
  • Solutions
  • Variable selection
  • PCA prior to performing classification

32
  • Many diagnostics
  • Modelling power of variables
  • Discriminatory power of variables
  • Quality of class model
  • Probabilities of class membership
  • Ambiguous classification is analytical data
    good enough?
  •  

33
  • MANY SOPHISTICATIONS
  • Large number of methods for classification based
    on LDA.
  • Bayesian methods based on prior probabilities.
  • Methods that try to find optimal groupings before
    class modelling.

34
  • LOTS OF INFORMATION
  • Class membership
  • Outliers
  • Whether another new class
  • Is a class well defined or are there subclasses
    e.g. subspecies or species from different
    environments
  • What measurements are most useful for
    discrimination. Can we reduce the number of
    measurements?
  • Are there ambiguous samples, and if so do we need
    more or better measurements?
  • Replicates analysis. Is our method sufficiently
    good for repeatability. Clinical diagnostics.

35
  • SIMCA sometimes used in chemometrics as an
    alternative
  •  
  • Soft
  • Independent
  • Modelling of
  • Class analogy

36
Use PCA models
37
  • Use PCA to model each class independently
  • Choose optimal number of PCs
  • Use distance from PC model as an indicator of
    class distance

38
VALIDATION OF A CLASS MODEL
  • Procedure.
  •  
  • Establish a training set.
  • Assess model with a test set.
  • Use model on real data.
  •  
  • Information
  •  
  • Graphical - e.g. diagrams
  • Quantitative - class distances
  • Quantitative - probability of membership of a
    given class.
  •  

39
Training set
Test set
40
  • SUMMARY
  • Cluster analysis unsupervised pattern
    recognition
  • Similarity measures
  • Linkage
  • Dendrograms
  • Classification supervised pattern recognition
  • Class models
  • Class distances
  • Graphical methods
Write a Comment
User Comments (0)
About PowerShow.com