Title: Multivariate Data Analysis
1Multivariate Data Analysis
- Principal Component Analysis
2Principal Component Analysis (PCA)
- Singular Value Decomposition
- Eigenvector / eigenvalue calculation
3Data Matrix (IxK)
K
- Reduce variables
- Improve projections
- Remove noise
- Find outliers
- Find classes
X
I
4PCA
- Example with 2 variables, 6 objects
- Find best (most informative) direction in space
- Describe direction
- Make projection
5x2
x1
6x2
x1
71st PC
81st PC
Score
Residual
91st PC
Loading p2
Unit vector
Loading p1
101st PC
Unit vector
Loading p2 sin (a)
?
Loading p1 cos(a)
11t
X
K
i
Score vector
I
p
Loading vector
12k
t
X
K
Score vector
I
p
Loading vector
13t
X
K
Score vector
I
p
Loading vector
14X t1p1 t2p2 ... tApA E
XTPE
X properly preprocessed (IxK) T Score matrix
(IxA) P loading matrix (KxA) E residual matrix
(IxK) ta score vector pa loading vector
15The Wine ExamplePeople magazineWise
Gallagher
16Wine Beer Spirit LifeEx HeartD
France Italy Switz Austra Brit
U.S.A. Russia Czech Japan Mexico
63.5000 40.1000 2.5000 78.0000
61.1000 58.0000 25.1000 0.9000 78.0000
94.1000 46.0000 65.0000 1.7000 78.0000
106.4000 15.7000 102.1000 1.2000
78.0000 173.0000 12.2000 100.0000 1.5000
77.0000 199.7000 8.9000 87.8000 2.0000
76.0000 176.0000 2.7000 17.1000
3.8000 69.0000 373.6000 1.7000 140.0000
1.0000 73.0000 283.7000 1.0000 55.0000
2.1000 79.0000 34.7000 0.2000 50.4000
0.8000 73.0000 36.4000
17Beer Wine Spirit LifeEx
HeartD
Mean
20.9900 68.2600 1.7500 75.9000
153.8700 24.9270 38.6718 0.9132
3.2128 110.8182
Standard Deviation
18Singular value
l146
32
12
8
2
Component
19Score 2 (32)
Czech
Brit
Austral
Mex
USA
Japan
Switz
Italy
France
Russia
Score 1 (46)
20Loading 2
Beer
Life exp.
Heart dis.
Wine
Spirit
Loading 1
21Conclusions
- Scores positions of objects in multivariate
space - Loadings importance of original variables for
new directions - Try to explain a large enough portion of X (4632
78)
22The Apricot Example
Manley Geladi
23Pseudoabsorbance
Appelkoos
Wavelength, nm
24Singular value
Scree plot
Component number
25What is rank?
- Mathematical rank max(min(I,K))
- Gives zero residual
- Effective rank A
- Separates model from noise
26ANOVA
SS
SS
SScum
Comp
68.8269 1.2843 0.0463 0.0045
0.0007 0.0003 0.0002 0.0001
0.0000 0.0000
98.10 1.83 0.07 0.01 0.00
0.00 0.00 0.00 0.00 0.00
98.10 99.93 100
1 2 3 4 5 6 7
8 9 10
70.1634
100
Total
27Score 2 (2)
Score 1 (98)
28ANOVA
- SStot SS1 SS2 SS3 ... SS(I or K)
SStot l1 l2 l3 ... l(I or K)
From largest to smallest!
29ANOVA
- X TP E
- data model residual
- SStot SSmod SSres
- R2 SSmod / SStot 1 - SSres / SStot
- Coefficient of determination (often in )
30Examples
- Wines R2 SSmod 78 SSres 22 2 Comp.
- Apricots 1 R2 SSmod 99.93 SSres 0.07
- 2 Comp.
- Apricots 2 R2 SSmod 100 SSres 0.0
- 3 Comp.
31Absorbance
Outliers removed
Wavelength, nm
32No outliers
Singular values
l181
16
3
Component
33Score 3 (3)
Whole fruit
No kernel
Thin slice
Score 2 (16)
34Loading 2 3
Wavelength, nm
35Loading 3
Loading 2
36More nomenclature
- Score Latent Variable
- Loading vector Eigenvector
- Effective rank Pseudorank Model
dimensionality Number of components - SSa Eigenvalue
- Singular value SSa1/2
37An analysis sequence
- 1. Scale, mean-center data
- 2. Calculate a few components
- 3. Check scores, loadings
- 4. Find outliers, groupings, explain
- 5. Remove outliers
38An analysis sequence
- 6. Scale, mean-center data
- 7. Calculate enough components
- 8. Try to detemine pseudorank
- 9. Check score plots
- 10. Check loading plots
- 11. Check residuals
39Wines
Residual stdev
1
2
4
0
3
40Wines
Residual stdev
4
0
1
2
3