Title: Robust PCA in Stata
1Robust PCA in Stata
- Vincenzo Verardi (vverardi_at_fundp.ac.be)
- FUNDP (Namur) and ULB (Brussels), Belgium
- FNRS Associate Researcher
2Principal component analysis
PCA, transforms a set of correlated variables
into a smaller set of uncorrelated variables
(principal components). For p random variables
X1,,Xp. the goal of PCA is to construct a new
set of p axes in the directions of greatest
variability.
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
3Principal component analysis
X2
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
X1
4Principal component analysis
X2
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
X1
5Principal component analysis
X2
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
X1
6Principal component analysis
X2
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
X1
7Principal component analysis
Hence, for the first principal component, the
goal is to find a linear transformation Y?1
X1?2 X2.. ?p Xp ( ?TX) such that tha variance
of Y (Var(?TX) ?T ? ? ) is maximal The
direction of d is given by the eigenvector
correponding to the largest eigenvalue of matrix S
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
8Classical PCA
- The second vector (orthogonal to the first), is
the one that has the second highest variance.
This corresponds to the eigenvector associated to
the second largest eigenvalue - And so on
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
9Classical PCA
- The new variables (PCs) have a variance equal to
their corresponding eigenvalue - Var(Yi) ?i for all i1p
- The relative variance explained by each PC is
given by ?i /? ?i
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
10Number of PC
- How many PC should be considered?
- Sufficient number of PCs to have a cumulative
variance explained that is at least 60-70 of the
total - Kaiser criterion keep PCs with eigenvalues gt1
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
11Drawback PCA
- PCA is based on the classical covariance matrix
which is sensitive to outliers Illustration
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
12Drawback PCA
- PCA is based on the classical covariance matrix
which is sensitive to outliers Illustration - . set obs 1000
- . drawnorm x1-x3, corr(C)
- . matrix list C
- c1 c2 c3
- r1 1
- r2 .7 1
- r3 .6 .5 1
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
13Drawback PCA
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
14Drawback PCA
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
15Drawback PCA
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
16Minimum Covariance Determinant
This drawback can be easily solved by basing the
PCA on a robust estimation of the covariance
(correlation) matrix. A well suited method for
this is MCD that considers all subsets containing
h of the observations (generally 50) and
estimates S and µ on the data of the subset
associated with the smallest covariance matrix
determinant. Intuition
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
17Generalized Variance
The generalized variance proposed by Wilks
(1932), is a one-dimensional measure of
multidimensional scatter. It is defined as
. In the 2x2 case it is easy to see
the underlying idea
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
18Minimum Covariance Determinant
Remember, MCD considers all subsets containing
50 of the observations However, if N200, the
number of subsets to consider would
be Solution use subsampling algorithms
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
19Fast-MCD Stata code
- The implemented algorithm
- Rousseeuw and Van Driessen (1999)
- P-subset
- Concentration (sorting distances)
- Estimation of robust SMCD
- Estimation of robust PCA
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
20P-subset
Consider a number of subsets containing (p1)
points (where p is the number of variables)
sufficiently large to be sure that at least one
of the subsets does not contain
outliers. Calculate the covariance matrix on each
subset and keep the one with the smallest
determinant Do some fine tuning to get closer to
the global solution
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
21Number of subsets
The minimal number of subsets we need to have a
probability (Pr) of having at least one clean if
a of outliers corrupt the dataset can be easily
derived
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Contamination
22Number of subsets
The minimal number of subsets we need to have a
probability (Pr) of having at least one clean if
a of outliers corrupt the dataset can be easily
derived
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Will be the probability that one random point in
the dataset is not an outlier
23Number of subsets
The minimal number of subsets we need to have a
probability (Pr) of having at least one clean if
a of outliers corrupt the dataset can be easily
derived
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Will be the probability that none of the p random
points in a p-subset is an outlier
24Number of subsets
The minimal number of subsets we need to have a
probability (Pr) of having at least one clean if
a of outliers corrupt the dataset can be easily
derived
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Will be the probability that at least one of the
p random points in a p-subset is an outlier
25Number of subsets
The minimal number of subsets we need to have a
probability (Pr) of having at least one clean if
a of outliers corrupt the dataset can be easily
derived
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Will be the probability that there is at least
one outlier in each of the N p-subsets considered
(i.e. that all p-subsets are corrupt)
26Number of subsets
The minimal number of subsets we need to have a
probability (Pr) of having at least one clean if
a of outliers corrupt the dataset can be easily
derived
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Will be the probability that there is at least
one clean p-subset among the N considered
27Number of subsets
The minimal number of subsets we need to have a
probability (Pr) of having at least one clean if
a of outliers corrupt the dataset can be easily
derived
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Rearranging we have
28Concentration steps
The preliminary p-subset step allowed to estimate
a preliminary S and µ Calculate Mahalanobis
distances using S and µ for all
individuals Mahalanobis distances, are defined as
. MD are distributed as for Gaussian data.
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
29Concentration steps
The preliminary p-subset step allowed to estimate
a preliminary S and µ Calculate Mahalanobis
distances using S and µ for all
individuals Sort individuals according to
Mahalanobis distances and re-estimate S and µ
using the first 50 observations Repeat the
previous step till convergence
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
30Robust S in Stata
In Stata, Hadis method is available to estimate
a robust Covariance matrix Unfortunately it is
not very robust The reason for this is simple, it
relies on a non-robust preliminary estimation of
the covariance matrix
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
31Hadis Stata code
- Compute a variant of MD
- Sort individuals according to . Use the
subset with the first p1 points to re-estimate µ
and S. - Compute MD and sort the data.
- Check if the first point out of the subset is an
outlier. If not, add this point to the subset and
repeat steps 3 and 4. Otherwise stop.
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
32Fast-MCD vs hadimvo
clear set obs 1000 local bsqrt(invchi2(5,0.95)) d
rawnorm x1-x5 e replace x1invnorm(uniform())5
in 1/100 mcd x, outlier gen RDRobust_distance ha
dimvo x, gen(a b) p(0.5) scatter RD b,
xline(b') yline(b')
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
33Illustration
Fast-MCD
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Hadi
34PCA without outliers
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
35PCA without outliers
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
36PCA with outliers
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
37PCA with outliers
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
38PCA with outliers
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
39Robust PCA with outliers
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
40Robust PCA with outliers
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
41Robust PCA with outliers
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
42Application rankings of Universities
QUESTION Can a single indicator accurately sum
up research excellence? GOAL Determine the
underlying factors measured by the variables used
in the Shanghai ranking ?Principal component
analysis
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
43ARWU variables
Alumni Alumni recipients of the Nobel prize or
the Fields Medal Award Current faculty Nobel
laureates and Fields Medal winners HiCi Highly
cited researchers NS Articles published in
Nature and Science PUB Articles in the Science
Citation Index-expanded, and the Social Science
Citation Index
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
44PCA analysis (on TOP 150)
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
45PCA analysis (on TOP 150)
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
46PCA analysis (on TOP 150)
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
47ARWU variables
The first component accounts for 68 of the
inertia and is given by F10.42Al.0.44Aw.0.48Hi
Ci0.50NS0.38PUB
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Variable Corr. (F1,Xi)
Alumni 0.78
Awards 0.81
HiCi 0.89
NS 0.92
PUB 0.70
Total score 0.99
48Robust PCA analysis (on TOP 150)
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
49Robust PCA analysis (on TOP 150)
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
50Robust PCA analysis (on TOP 150)
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
51Robust PCA analysis (on TOP 150)
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
52ARWU variables
Two underlying factors are uncovered F1 explains
38 of inertia and F2 explains 28 of inertia
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Variable Corr. (F1,) Corr. (F2,)
Alumni -0.05 0.78
Awards -0.01 0.83
HiCi 0.74 0.88
NS 0.63 0.95
PUB 0.72 0.63
Total score 0.99 0.47
53Conclusion
Classical PCA could be heavily distorted by the
presence of outliers. A robustified version of
PCA could be obtained either by relying on a
robust covariance matrix or by removing
multivariate outliers identified through a robust
identification method.
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion