Title: Matrix Decompositions in Dimension Reduction for Undersampled Clustered Data
1Matrix Decompositions in Dimension Reduction
for Undersampled Clustered Data
- Haesun Park
- 8803 Num
- Numerical Methods in CSE
2Cluster Structure Preserving Dimension Reduction
for Feature Extraction
- Algorithms
LDA/GSVD, Orthogonal Centroid,
Extension to Kernel-based Nonlinear
method, - Applications
Text classification, Face
recognition, Fingerprint
classification - Experimental Results Classification with
kNN, SVM, Centroid-based
classification
32D VisualizationImportance of utilizing cluster
structure
2D representation of 150x1000 data with 7
clusters LDA vs. SVD
4Facial Recognition
ATT (ORL ) Face Database
- image size 92 x 112
- 400 frontal images 40 person 1o images each
variations in pose, facial expression - Severely Undersampled
The 1st sample
The 35th sample
5Dimension Reduction
Original images/data
r
i
1
2
Data preprocessing
Form Data Matrix
m x n
m x 1
Dimension Reducing Transformation
r
1
2
i
q x 1
q x n
Classification
Lower Dimensional Representation
Want Dimension reducing transformation that can
be effectively applied across many
application areas
6Measure for Cluster Quality
- A a1 ... an mxn, clustered data
Ni
items in class i, Ni ni total r
classesci average of data items in class i,
centroidc global average, global centroid
- Within-class scatter matrix
- Sw ?1 i r ? j?Ni (aj ci ) (aj
ci )T
(2) Between-class scatter matrix Sb
?1 i r ? j ?Ni (ci c) (ci c)T
(3)Total scatter matrix St ?1 i n
(ai c ) (ai c )T
7Trace of Scatter Matrices
trace (Sw ) ?1 i r ? j ? Ni aj ci
2
2
trace (Sb ) ?1 i r ? j ? Ni ci - c 2
2
trace (St ) ?1 i r ? j ? Ni aj c 2
Dimension Reducing Transformation
2
Between-class scatter
8Optimal Dimension Reducing Transformation
GT qxm
GTy qx1
y mx1
- High quality clusters have
small trace(Sw) large
trace(Sb) - Want dimension reduction by GT
s.t. min trace(GT SwG) max trace(GT Sb G) - max trace ((GT SwG)-1 (GT Sb G)) ? LDA (Fisher
36, Rao 48) - max trace (GT Sb G) ? Orthogonal Centroid (Park
et al. 03) - max trace (GT (SwSb )G) ? PCA (Hotelling 33)
- max trace (GT A AT G) ? LSI (Deerwester et al. 90)
GTGI
GTGI
GTGI
9Classical LDA(Fisher 36, Rao 48)
- max trace ((GT SwG)-1 (GT Sb G))
- G leading (r-1) e.vectors of Sw-1Sb, SbxlSwx
- Fails when mgtn (undersampled), Sw singular
x
- SwHw HwT, Hwa1-c1, a2-c1, , an-cr mxn
- SbHb HbT, Hb 1/ n1(c1 -c), , 1/ nr (cr -
c) mxr
10LDA based on GSVD (LDA/GSVD) (Howland, Jeon, Park
03, SIMAX)
- Works regardless of singularity of scatter
matrices - Sw-1Sb x l x ? d2Hb HbTx b2Hw HwTx
- Columns of G are leading (r-1) generalized
singular vectors of HbT and HwT
UT HbT X
(Sb
0)
0
(Sw
0)
VT HwT X
0
11Generalized Singular Value Decomposition(Van
Loan 76, Paige and Saunders 81)
XTSb X
XTSw X
d2Hb HbTx b2Hw HwTx , X X1 X2 X3 X4 where
12Generalization of LDA for Undersampled Problems
- Regularized LDA (Friedman 89, Zhao et al. 99
) - LDA/GSVD Solution G X1 X2 (Howland,
Jeon, Park 03) - Solutions based on Null(Sw ) and Range(Sb )
(Chen et al. 00, Yu Yang 01, Park
Park 03 ) - Two-stage methods
- Face Recognition PCA LDA (Swets Weng 96 ,
Zhao et al. 99 ) - Information Retrieval LSI LDA (Torkkola 01)
- Mathematical Equivalence (Howland and Park 03)
LSI LDA/GSVD LDA/GSVD PCALDA/GSVD
LDA/GSVD More efficient QRD LDA/GSVD
13Orthogonal Centroid (OC) Algorithm (Park,
Jeon, Rosen 03, BIT)
- Algorithm
- 1. Form Centroid matrix
- Cc1 , , cr m x r
- 2. Compute QRD of C C Q R, Q m x r
- Dimension reduction by QT to r dim. Space
y m x 1 QT y r x 1 - Q solves max trace(GTSbG) trace(QTSbQ)
trace(Sb) - Need QRD of C m x r vs. EVD of Sb m x m
- (or
SVD of Hb m x r)
GTGI
14Text Classification on Medline Data
(Kim, Howland, Park 03, JMLR)
Classification accuracy (), 5
classes Similarity measures L2 norm and Cosine
15Text Classification on Reuters Data
(Kim, Howland, Park 03, JMLR)
Classification accuracy (), 90
classes Similarity measures L2 norm and Cosine
16Face Recognition on ATT Data
- Orthogonal Centroid 88 96
- LDA/GSVD 90 98
Orthogonal Centroid
LDA/GSVD
Query Image
Top choice
Second choice
Third choice
Classification Accuracy using centroid, kNN (1,
3, 5, 7) with L2 norm
Average of 100 runs, random split of training and
test data
17Face Recognition on Yale Data
(C. Park and H. Park)
Dim. Red. Method Dim
kNN
k1 k5 k9 Full Space
8586 79.4 76.4
72.1 LDA/GSVD 14
98.8 (90) 98.8 98.8 Regularized LDA (l1)
14 97.6 (85) 97.6 97.6 Proj.
to null (Sw) 14 97.6 (84)
97.6 97.6 (Chen et al., 00) Transf. to
range(Sb) 14 89.7 (82) 94.6
91.5 (Yu Yang, 01)
Prediction Accuracy in , leave-one-out (average
of 100 random split) Yale Face Database 243 x
320 pixels full dimension of 77760
11 images/person x 15 people
165 images After Preprocessing (avg 3x3) 8586 x
165
18Nonlinear Dimension Reduction by Kernel Functions
x12 2 x1x2 x22
x1 x2
x
F(x)
,
k (x, y) lt F(x), F(y) gt lt x, y gt2
(a polynomial kernel function)
F
19Nonlinear Dimension Reduction by Kernel Functions
- If k(x,y) satisfies Mercers condition, then
there is a mapping F to an inner product space, - k(x,y) lt F(x), F(y) gt
F
A lt x, y gt
(A)
F
k(x,y) lt F(x), F(y) gt
Mercers condition for Aa1,,an kernel
matrix K k(ai, aj) 1i, jn is
positive semi-definite.
Ex) RBF Kernel Function k(ai, aj) exp(-sai
aj 2)
20Kernel Orthogonal Centroid (KOC)
(C. Park and H. Park, PR)
- Apply OC in feature mapped space F(A)
- Need QRD of Centroid matrix C in F(A),
-
- but C is unknown
C 1/n1 S F(ai), , 1/nr S F(ai) QR
C 1/n1 S F(ai), , 1/nr S F(ai) QR
i?N1
i?Nr
- CTC MT K M RTR, where K k(ai , aj)1lt
i, j ltn - z QT y R-T CT y
- ? RT z CT y
1/n1 S k(ai ,y)
i?N1
...
1/nrS k(ai ,y)
i?Nr
21Experimental Results
- Musk (from UCI) kNN OC KOC
Kernel PCA - dim 167 k1 87.2
95.7 87.8 - of classes 2 15 88.5 96.0
89.2 - of data 6599 29 88.5 96.1
88.5
OC KOC
Kernel PCA
..
(Scholkopf et al., 1999)
22Fingerprint Classification
Left Loop Right Loop Whorl
Arch Tented Arch
Construction of Directional Images by DFT
1. Compute directionality in local neighborhood
by FFT 2. Compute the dominant direction 3. Find
core point for unified centering of fingerprints
within the same class
23Fingerprint Classification Results on NIST
Fingerprint Database 4
(C. Park and H. Park 03)
KDA/GSVD Nonlinear Extension of LDA/GSVD
based on Kernel Functions
Rejection rate() 0 1.8
8.5 KDA/GSVD 90.7
91.3 92.8 kNN NN Jain et al., 99 -
90.0 91.2 NN SVM Yao et al., 03
- 90.0 92.2
4000 fingerprint images of size 512x512 By
KDA/GSVD, dimension reduced from 105x105 to 4
24Support Vector Machine (SVM)
for binary classification (hard margin)
(Vapnik, Scholkopf, Burges)
Feature space H
Input space Rd
K(xi,xj )
SVM constructs the optimal separating hyperplane
which maximizes the margin between two
classes. For problems not linearly separable,
input data are mapped to feature space using a
kernel function K(xi,xj). Support vectors are
identified with an extra circle.