Robust PCA in Stata - PowerPoint PPT Presentation

About This Presentation

Title:

Robust PCA in Stata

Description:

Robust PCA in Stata Vincenzo Verardi (vverardi_at_fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher * Introduction Robust Covariance ... – PowerPoint PPT presentation

Number of Views:419

Avg rating:3.0/5.0

Slides: 54

Provided by: vve9

Category:

more less

Transcript and Presenter's Notes

Title: Robust PCA in Stata

1
Robust PCA in Stata

Vincenzo Verardi (vverardi_at_fundp.ac.be)
FUNDP (Namur) and ULB (Brussels), Belgium
FNRS Associate Researcher

2
Principal component analysis
PCA, transforms a set of correlated variables
into a smaller set of uncorrelated variables
(principal components). For p random variables
X1,,Xp. the goal of PCA is to construct a new
set of p axes in the directions of greatest
variability.
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
3
Principal component analysis
X2
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
X1
4
Principal component analysis
X2
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
X1
5
Principal component analysis
X2
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
X1
6
Principal component analysis
X2
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
X1
7
Principal component analysis
Hence, for the first principal component, the
goal is to find a linear transformation Y?1
X1?2 X2.. ?p Xp ( ?TX) such that tha variance
of Y (Var(?TX) ?T ? ? ) is maximal The
direction of d is given by the eigenvector
correponding to the largest eigenvalue of matrix S
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
8
Classical PCA

The second vector (orthogonal to the first), is
the one that has the second highest variance.
This corresponds to the eigenvector associated to
the second largest eigenvalue
And so on

Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
9
Classical PCA

The new variables (PCs) have a variance equal to
their corresponding eigenvalue
Var(Yi) ?i for all i1p
The relative variance explained by each PC is
given by ?i /? ?i

Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
10
Number of PC

How many PC should be considered?
Sufficient number of PCs to have a cumulative
variance explained that is at least 60-70 of the
total
Kaiser criterion keep PCs with eigenvalues gt1

Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
11
Drawback PCA

PCA is based on the classical covariance matrix
which is sensitive to outliers Illustration

Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
12
Drawback PCA

PCA is based on the classical covariance matrix
which is sensitive to outliers Illustration
. set obs 1000
. drawnorm x1-x3, corr(C)
. matrix list C
c1 c2 c3
r1 1
r2 .7 1
r3 .6 .5 1

Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
13
Drawback PCA
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
14
Drawback PCA
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
15
Drawback PCA
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
16
Minimum Covariance Determinant
This drawback can be easily solved by basing the
PCA on a robust estimation of the covariance
(correlation) matrix. A well suited method for
this is MCD that considers all subsets containing
h of the observations (generally 50) and
estimates S and µ on the data of the subset
associated with the smallest covariance matrix
determinant. Intuition
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
17
Generalized Variance
The generalized variance proposed by Wilks
(1932), is a one-dimensional measure of
multidimensional scatter. It is defined as
. In the 2x2 case it is easy to see
the underlying idea
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
18
Minimum Covariance Determinant
Remember, MCD considers all subsets containing
50 of the observations However, if N200, the
number of subsets to consider would
be Solution use subsampling algorithms
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
19
Fast-MCD Stata code

The implemented algorithm
Rousseeuw and Van Driessen (1999)
P-subset
Concentration (sorting distances)
Estimation of robust SMCD
Estimation of robust PCA

Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
20
P-subset
Consider a number of subsets containing (p1)
points (where p is the number of variables)
sufficiently large to be sure that at least one
of the subsets does not contain
outliers. Calculate the covariance matrix on each
subset and keep the one with the smallest
determinant Do some fine tuning to get closer to
the global solution
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
21
Number of subsets
The minimal number of subsets we need to have a
probability (Pr) of having at least one clean if
a of outliers corrupt the dataset can be easily
derived
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Contamination
22
Number of subsets
The minimal number of subsets we need to have a
probability (Pr) of having at least one clean if
a of outliers corrupt the dataset can be easily
derived
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Will be the probability that one random point in
the dataset is not an outlier
23
Number of subsets
The minimal number of subsets we need to have a
probability (Pr) of having at least one clean if
a of outliers corrupt the dataset can be easily
derived
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Will be the probability that none of the p random
points in a p-subset is an outlier
24
Number of subsets
The minimal number of subsets we need to have a
probability (Pr) of having at least one clean if
a of outliers corrupt the dataset can be easily
derived
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Will be the probability that at least one of the
p random points in a p-subset is an outlier
25
Number of subsets
The minimal number of subsets we need to have a
probability (Pr) of having at least one clean if
a of outliers corrupt the dataset can be easily
derived
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Will be the probability that there is at least
one outlier in each of the N p-subsets considered
(i.e. that all p-subsets are corrupt)
26
Number of subsets
The minimal number of subsets we need to have a
probability (Pr) of having at least one clean if
a of outliers corrupt the dataset can be easily
derived
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Will be the probability that there is at least
one clean p-subset among the N considered
27
Number of subsets
The minimal number of subsets we need to have a
probability (Pr) of having at least one clean if
a of outliers corrupt the dataset can be easily
derived
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Rearranging we have
28
Concentration steps
The preliminary p-subset step allowed to estimate
a preliminary S and µ Calculate Mahalanobis
distances using S and µ for all
individuals Mahalanobis distances, are defined as

. MD are distributed as for Gaussian data.
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
29
Concentration steps
The preliminary p-subset step allowed to estimate
a preliminary S and µ Calculate Mahalanobis
distances using S and µ for all
individuals Sort individuals according to
Mahalanobis distances and re-estimate S and µ
using the first 50 observations Repeat the
previous step till convergence
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
30
Robust S in Stata
In Stata, Hadis method is available to estimate
a robust Covariance matrix Unfortunately it is
not very robust The reason for this is simple, it
relies on a non-robust preliminary estimation of
the covariance matrix
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
31
Hadis Stata code

Compute a variant of MD
Sort individuals according to . Use the
subset with the first p1 points to re-estimate µ
and S.
Compute MD and sort the data.
Check if the first point out of the subset is an
outlier. If not, add this point to the subset and
repeat steps 3 and 4. Otherwise stop.

Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
32
Fast-MCD vs hadimvo
clear set obs 1000 local bsqrt(invchi2(5,0.95)) d
rawnorm x1-x5 e replace x1invnorm(uniform())5
in 1/100 mcd x, outlier gen RDRobust_distance ha
dimvo x, gen(a b) p(0.5) scatter RD b,
xline(b') yline(b')
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
33
Illustration
Fast-MCD
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Hadi
34
PCA without outliers
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
35
PCA without outliers
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
36
PCA with outliers
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
37
PCA with outliers
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
38
PCA with outliers
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
39
Robust PCA with outliers
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
40
Robust PCA with outliers
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
41
Robust PCA with outliers
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
42
Application rankings of Universities
QUESTION Can a single indicator accurately sum
up research excellence? GOAL Determine the
underlying factors measured by the variables used
in the Shanghai ranking ?Principal component
analysis
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
43
ARWU variables
Alumni Alumni recipients of the Nobel prize or
the Fields Medal Award Current faculty Nobel
laureates and Fields Medal winners HiCi Highly
cited researchers NS Articles published in
Nature and Science PUB Articles in the Science
Citation Index-expanded, and the Social Science
Citation Index
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
44
PCA analysis (on TOP 150)
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
45
PCA analysis (on TOP 150)
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
46
PCA analysis (on TOP 150)
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
47
ARWU variables
The first component accounts for 68 of the
inertia and is given by F10.42Al.0.44Aw.0.48Hi
Ci0.50NS0.38PUB
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Variable Corr. (F1,Xi)
Alumni 0.78
Awards 0.81
HiCi 0.89
NS 0.92
PUB 0.70
Total score 0.99
48
Robust PCA analysis (on TOP 150)
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
49
Robust PCA analysis (on TOP 150)
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
50
Robust PCA analysis (on TOP 150)
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
51
Robust PCA analysis (on TOP 150)
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
52
ARWU variables
Two underlying factors are uncovered F1 explains
38 of inertia and F2 explains 28 of inertia
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion
Variable Corr. (F1,) Corr. (F2,)
Alumni -0.05 0.78
Awards -0.01 0.83
HiCi 0.74 0.88
NS 0.63 0.95
PUB 0.72 0.63
Total score 0.99 0.47
53
Conclusion
Classical PCA could be heavily distorted by the
presence of outliers. A robustified version of
PCA could be obtained either by relying on a
robust covariance matrix or by removing
multivariate outliers identified through a robust
identification method.
Introduction Robust Covariance Matrix Robust
PCA Application Conclusion

Write a Comment

User Comments (0)