PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS

Description:

The loadings do not need to correspond to spectral wavelengths they can be any type of sample ... scores vector (column of T), and loadings vector (column of P) ... – PowerPoint PPT presentation

Number of Views:276
Avg rating:3.0/5.0
Slides: 42
Provided by: richardb165
Category:

less

Transcript and Presenter's Notes

Title: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS


1
PATTERN RECOGNITION PRINCIPAL COMPONENTS
ANALYSIS Richard Brereton r.g.brereton_at_bris.ac.
uk
2
  • NEED FOR PATTERN RECOGNITION
  • Exploratory data analysis
  • e.g. PCA
  • Unsupervised pattern recognition
  • e.g. Cluster analysis
  • Supervised pattern recognition
  • e.g. Classification

3
Case study Coupled chromatography in HPLC
profile
4
MULTIVARIATE DATA

Wavelength columns

Time rows

5
  • DATA MATRICES
  • The rows do not need to correspond to elution
    times in chromatography they can be any type of
    sample
  • Blood sample
  • Wood
  • Chromatograms
  • Samples from a reaction mixture
  • Chromatographic columns

6
  • The loadings do not need to correspond to
    spectral wavelengths they can be any type of
    sample
  • NMR peak heights
  • Atomic spectroscopy measurements of elements
  • Chromatographic intensities
  • Concentrations of compounds in a mixture
  • Results of chromatographic tests

7
Return to example of chromatography. Rows
elution times Columns wavelengths
8
Chemical factors X C.S E
9
It would be nice to look at the chemical factors
underlying the chromatogram. We can use
mathematical methods to do this.
10
ABSTRACT FACTORS PRINCIPAL COMPONENTS
11
X T . P E C . S E
T are called scores these correspond to elution
profile P are called loadings these correspond
to spectra Ideally the size of T and P equals
the number of compounds in the mixture. This
size equals the number of principal components,
e.g. 1, 2, 3 etc. Each PC has an associated
scores vector (column of T), and loadings vector
(column of P).
12

13
  • Hence if the original data matrix is dimensions
    30 ? 28 (or I ? J) ( 30 elution times and 28
    wavelengths - or 30 blood samples and 28 compound
    concentrations - or 30 chromatographic columns
    and 28 tests) and if the number of PCs is denoted
    by A, then
  • the dimensions of T will be 30 ? A, and
  • the dimensions of P will be A ? 28.

14
(No Transcript)
15
A major reason for performing PCA is data
simplification. Often datasets are very complex,
it is possible to make many measurements, but
only a few underlying factors. See the wood from
the trees. Will look at this in more detail
later.
16
  • SCORES AND LOADINGS HAVE SPECIAL MATHEMATICAL
    PROPERTIES
  • Scores and loadings are orthogonal.
  • What does this mean?
  • Loadings are normalised.
  • What does this mean?

17
PCA is an abstract concept. Theory.
Non-mathematical Spectrum recorded at different
concentrations and several wavelengths
wavelength 6 versus 9 six spectra.
18
(No Transcript)
19
  • Each spectrum becomes ONE POINT IN 2 DIMENSIONAL
    SPACE
  • (2D 2 wavelengths)
  • Spectra
  • Fall on a straight line which is the FIRST
    PRINCIPAL COMPONENT
  • The line has a DIRECTION often called the
    LOADINGS corresponding to the SPECTRAL
    CHARACTERISTICS
  • Each spectrum has a DISTANCE along the line often
    called the SCORES corresponding to CONCENTRATION

20
  • EXTENSIONS TO THE IDEA
  •  
  • Measurement error
  • Several wavelengths
  • Several compounds

21
Measurement error
  • Best fit straight line - statistics
  • Two PCs - the second relates to the error around
    the straight

22
  • Several wavelengths
  •  
  • Now no longer a point in 2 dimensional space.
  • Typical spectrum. Several thousand wavelengths
  •  The number of dimensions equals the number of
    wavelengths.
  • The spectra still fall (roughly) on a straight
    line.
  • A point in 1000 dimensional space.

23
Several compounds Two compounds, two wavelengths.
A
B
24
  • RANK AND EIGENVALUE
  • How many PCs describe a dataset?
  • Often unknown
  • How many compounds in a series of mixtures?
  • How many sources of pollution?
  • How many compounds in a reaction mixture?
  • Sometimes just statistical concept.
  • Sometimes mixture of physical and chemical
    factors, e.g. a reaction mixture compounds,
    temperature etc.

25
  • EVERY PRINCIPAL COMPONENT HAS A CORRESPONDING
    EIGENVALUE
  • The eigenvalue equals the sum of squares of the
    scores vector for each PC.
  • The more important the PC the bigger the
    eigenvalue.
  • The sum of squares of the eigenvalues of a matrix
    should never exceed that of the original matrix.
  • The sum of squares of all significant PCs should
    approximate to that of the original matrix.

26
RESIDUAL SUM OF SQUARES decreases as the number
of eigenvalues increases. Log eigenvalue versus
component number. Cut off?
27
SEVERAL OTHER APPROACHES FOR THE DETERMINATION OF
NUMBER OF EIGENVALUES.
28
  • SUMMARY SO FAR
  • PCA
  • Principal components how many?
  • Scores
  • Loadings
  • Eigenvalues

29
GRAPHIC DISPLAY OF PCS SCORES PLOT PC2 VERSUS PC1

30
SCORES AGAINST TIME PC1 AND PC2 VERSUS TIME
31
LOADINGS PLOT PC2 VERSUS PC1
32
FOR REFERENCE pure spectra
33
LOADINGS AGAINST WAVELENGTH PC1 AND PC2 VERSUS
WAVELENGTH
34
BIPLOTS SUPERIMPOSING SCORES AND LOADINGS PLOTS
35
  • MANY OTHER PLOTS
  • Not only PC2 versus 1, also PC3 versus 1, PC3
    versus 2 etc.
  • 3D PC plots, 3 axes, rotation etc.
  • Loadings and scores sometimes presented as bar
    graphs, not always a sequential meaning.
  • Plots of eigenvalues against component number

36
  • DATA SCALING AND PREPROCESSING
  • Influences appearance of plots
  • Column centring common in traditional
    statistics
  • Standardisation of columns subtract mean and
    divide by standard deviation.
  • If data of different types or absolute scales
    this is an essential technique
  • Row scaling to constant total

37
ANOTHER EXAMPLE Grouping of elements from
fundamental properties using PCA.
38
Step 1 standardise the data. Why? On different
scales.
39
PERFORM PCA Choose the first two PCs Scores plot
40
Loadings plot
41
  • SUMMARY
  • Many types of plot from PCA.
  • Interpretation of the plots.
  • Preprocessing important.
Write a Comment
User Comments (0)
About PowerShow.com