Title: Feature Extraction
1Feature Extraction
??????
2Content
- Principal Component Analysis (PCA)
- PCA Calculation for Fewer-Sample Case
- Factor Analysis
- Fishers Linear Discriminant Analysis
- Multiple Discriminant Analysis
3Feature Extraction
Principal Component Analysis (PCA)
4Principle Component Analysis
- It is a linear procedure to find the direction in
input space where most of the energy of the input
lies. - Feature Extraction
- Dimension Reduction
- It is also called the (discrete) Karhunen-Loève
transform, or the Hotelling transform.
5The Basis Concept
Assume data x (random vector) has zero mean.
PCA finds a unit vector w to reflect the largest
amount of variance of the data.
That is,
Demo
6The Method
Remark C is symmetric and semipositive definite.
Covariance Matrix
7The Method
maximize
subject to
The method of Lagrange multiplier
Define
The extreme point, say, w satisfies
8The Method
maximize
subject to
Setting
9Discussion
At extreme points
- Let w1, w2, , wd be the eigenvectors of C whose
corresponding eigenvalues are ?1? ?2 ? ? ?d. - They are called the principal components of C.
- Their significance can be ordered according to
their eigenvalues.
w is a eigenvector of C, and ? is its
corresponding eigenvalue.
10Discussion
At extreme points
- Let w1, w2, , wd be the eigenvectors of C whose
corresponding eigenvalues are ?1? ?2 ? ? ?d. - They are called the principal components of C.
- Their significance can be ordered according to
their eigenvalues.
- If C is symmetric and semipositive definite, all
their eigenvectors are orthogonal. - They, hence, form a basis of the feature space.
- For dimensionality reduction, only choose few of
them.
11Applications
- Image Processing
- Signal Processing
- Compression
- Feature Extraction
- Pattern Recognition
12Example
Projecting the data onto the most significant
axis will facilitate classification.
This also achieves dimensionality reduction.
13Issues
The most significant component obtained using PCA.
- PCA is effective for identifying the multivariate
signal distribution. - Hence, it is good for signal reconstruction.
- But, it may be inappropriate for pattern
classification.
The most significant component for classification
14Whitening
- Whitening is a process that transforms the random
vector, say, x (x1, x2 , , xn )T (assumed it
is zero mean) to, say, z (z1, z2 , , zn )T
with zero mean and unit variance. - z is said to be white or sphered.
- This implies that all of its elements are
uncorrelated. - However, this doesnt implies its elements are
independent.
15Whitening Transform
Clearly, D is a diagonal matrix and E is an
orthonormal matrix.
Let V be a whitening transform, then
Decompose Cx as
Set
16Whitening Transform
If V is a whitening transform, and U is any
orthonormal matrix, show that UV, i.e., rotation,
is also a whitening transform.
Proof)
17Why Whitening?
- With PCA, we usually choose several major
eigenvectors as the basis for representation. - This basis is efficient for reconstruction, but
may be inappropriate for other applications,
e.g., classification. - By whitening, we can rotate the basis to get more
interesting features.
18Feature Extraction
PCA Calculation for Fewer-Sample Case
19Complexity for PCA Calculation
- Let C be of size n n
- Time complexity by direct computation - O(n3)
- Are there any efficient method in case that
20PCA for Covariance Matrixfrom Fewer Samples
- Consider N samples of
- with
-
- Define
21PCA for Covariance Matrixfrom Fewer Samples
- Define N N matrix
- Let be the orthonormal
eigenvectors of of T with corresponding
eigenvalues ?i, i.e.,
22PCA for Covariance Matrixfrom Fewer Samples
23PCA for Covariance Matrixfrom Fewer Samples
Define
24PCA for Covariance Matrixfrom Fewer Samples
Define
pi are orthonormal eigenvectors of C with
eigenvalues
25Feature Extraction
Factor Analysis
26What is a Factor?
- If several variables correlate highly, they might
measure aspects of a common underlying dimension. - These dimensions are called factors.
- Factors are classification axis along which the
measures can be plotted. - The greater the loading of variables on a factor,
the more that factor can explain
intercorrelations between those variables.
27Graph Representation
28What is Factor Analysis?
- A method for investigating whether a number of
variables of interest Y1, Y2, , Yn, are linearly
related to a smaller number of unobservable
factors F1, F2, , Fm. - For data reduction and summarization.
- Statistical approach to analyze
interrelationships among the large number of
variables to explain these variables in term of
their common underlying dimensions (factors).
29Example
What factors influence students grades?
Quantitative skill?
unobservable
Verbal skill?
Observable Data
30The Model
y Observation Vector
B Factor-Loading Matrix
f Factor Vector
? Gaussian-Noise Matrix
31The Model
y Observation Vector
B Factor-Loading Matrix
f Factor Vector
? Gaussian-Noise Matrix
32The Model
Can be obtained from the model
Can be estimated from data
33The Model
Commuality
Specific Variance
Explained
Unexplained
34Example
35Goal
Our goal is to minimize
Hence,
36Uniqueness
Is the solution unique?
There are infinite number of solutions.
Since if B is a solution and T is an orthonormal
transformation (rotation), then BT is also a
solution.
37Example
Cy
Which one is better?
38Example
Left each factor have nonzero loading for all
variables.
Right each factor controls different variables.
39The Method
- Determine the first set of loadings using
principal component method.
40Example
41Factor Rotation
Factor-Loading Matrix
Rotation Matrix
Factor Rotation
42Factor Rotation
Criteria
- Varimax
- Quartimax
- Equimax
- Orthomax
- Oblimin
Factor-Loading Matrix
Factor Rotation
43Varimax
Criterion
Maxmize
Subject to
Let
44Varimax
Criterion
Maxmize
Subject to
Construct the Lagrangian
45Varimax
dk
cjk
bjk
46Varimax
Define
is the kth column of
47Varimax
is the kth column of
48Varimax
Goal
reaches maximum once
49Varimax
Goal
- Initially,
- obtain B0 by whatever method, e.g., PCA.
- set T0 as the approximation rotation matrix,
e.g., T0I.
50Varimax
Goal
Pre-multiplying each side by its transpose.
51Varimax
Criterion
Maximize
52Varimax
Maximize
Let
53Feature Extraction
Fishers Linear Discriminant Analysis
54Main Concept
- PCA seeks directions that are efficient for
representation. - Discriminant analysis seeks directions that are
efficient for discrimination.
55Classification Efficiencies on Projections
56Criterion ? Two-Category
w 1
w
57Scatter
Between-Class Scatter Matrix
w 1
w
Between-Class Scatter
The larger the better
58Scatter
Between-Class Scatter Matrix
Within-Class Scatter Matrix
w 1
w
Within-Class Scatter
The smaller the better
59Goal
Between-Class Scatter Matrix
Within-Class Scatter Matrix
w 1
Define
Generalized Rayleigh quotient
w
The length of w is immaterial.
60Generalized Eigenvector
To maximize J(w), w is the generalized
eigenvector associated with largest generalized
eigenvalue.
Define
Generalized Rayleigh quotient
That is,
or
The length of w is immaterial.
61Proof
To maximize J(w), w is the generalized
eigenvector associated with largest generalized
eigenvalue.
Set
That is,
or
?
62Example
63Feature Extraction
Multiple Discriminant Analysis
64Generalization of Fishers Linear Discriminant
For the c-class problem, we seek a
(c?1)-dimension projection for efficient
discrimination.
65Scatter Matrices ? Feature Space
Total Scatter Matrix
Within-Class Scatter Matrix
Between-Class Scatter Matrix
66The (c?1)-Dim Projection
The projection space will be described using a
d?(c?1) matrix W.
67Scatter Matrices ? Projection Space
Total Scatter Matrix
Within-Class Scatter Matrix
W
Between-Class Scatter Matrix
68Criterion