Title: PCA, LDA, HLDA and HDA
1PCA, LDA, HLDA and HDA
- Reference
- E. Alpaydin, Introduction to Machine Learning,
The MIT Press, 2004. - S. R. Searle, Matrix Algebra Useful for
Statistics, Wiley Series in Probability and
Mathematical Statistics, New Yourk, 1982. - Berlin Chens Sliders.
- N. Kumar and A. G. Andreou, Heteroscedastic
Discriminant Analysis and Redued Rank HMMs for
Improved Speech Recognition, Speech
Communication, 26283-297, 1998. - G. Saon, M. Padmanabhan, R. Gopinath and S. Chen,
Maximum Likelihood Discriminant Feature Spaces,
ICASSP, 2000. - X. Liu, Linear Projection Schemes for Automatic
Speech Recognition, Master of Philosophy,
University of Cambridge, 2001. - ???, ????????????????????????????, ????, 2005.
2PCA Introduction
- PCA (Principle Component Analysis) is a one-group
and unsupervised projection method for reducing
data dimensionality (feature extraction). - We make use of PCA to find a mapping from the
inputs in the original d-dimensional space to a
new (k lt d)-dimensional space, with - minimum loss of information.
- maximum amount of information measured in terms
of variability. - That is, well find the new variables or linear
transformations (major axes, principal
components) to reach the goal. - The projection of x on the direction of w is, z
wTx.
3PCA Methodology (General)
- For maximizing the amount of information, PCA
centers the sample and then rotates the axes to
line up with the directions of highest variance. - That is, find w such that Var(z) is maximized,
which is the criterion. - Var(z) Var(wTx) E(wTx wTµ)2
E(wTx wTµ)(wTx wTµ) - EwT(x µ)(x µ)Tw
- wT E(x µ)(x µ)Tw wT ? w
Var(x) E(x µ)2 E(x µ)(x µ)T ?
4PCA Methodology (General) (cont.)
- Maximize Var(z1) w1T ? w1 subject to w1
1. - For a unique solution and to make the direction
the important factor, we require w 1, i.e.,
w1Tw1 1. - w1 is an eigenvector of ?, ? is an eigenvalue
associated with w1. - Var(z1) Var(w1Tx) w1T ? w1 w1T ? w1 ?
w1Tw1 ? - max Var(z1) max ?, so choose the one with the
largest eigenvalue for Var(z) to be max.
? is Lagrange multiplier.
5PCA Methodology (General) (cont.)
- Second principal component max Var(z2), s.t.,
w21 and orthogonal to w1. - That is, max Var(z1) max ?.
- Conclusions
- wi is the eigenvector of ? associated with the
ith largest eigenvalue. Var(zi) ?i. (The above
can be proved by mathematical induction.) - wis are uncorrelated (orthogonal).
- w1 explains as much as possible of original
variance in data set. - w2 explains as much as possible of remaining
variance, etc.
6PCA Some Discussions
- About dimensions
- If the dimensions are highly correlated, there
will be a small number of eigenvectors with large
eigenvalues, and k will be much smaller than d
and a large reduction in dimensionality may be
attained. - If the dimensions are not correlated, k will be
as large as d and there is no gain through PCA.
7PCA Some Discussions (cont.)
- About ?
- For two different eigenvalues, the eigenvectors
are orthogonal. - If ? is positive definite (xT ? x gt 0, x !
null), all eigenvalues are positive. - If ? is singular, then its rank, the effective
dimension is k lt d, and ?i 0, i gt k. - About scaling
- Different variables have completely different
scaling. - Eigenvalues of the matrix is scale dependent.
- If scale of the data is unknown then it is better
to use correlation matrix instead of the
covariance matrix. - The interpretation of the principal components
derived by these two methods can be completely
different.
8PCA Methodology (SD)
- z WTx, or z WT(x - m) center the data on the
origin. - To find a matrix w such that when we have z
WTx, we will get Cov(z) D is any diagonal
matrix. - We would like to get uncorrelated zi.
- Let C ci be the normalized eigenvectors of S,
then - CTC I
- S SCCT S (c1, c2, , cd) CT (Sc1,
Sc2, , Scd) CT (?c1, ?c2, , ?cd) CT
?c1 c1T ?c2 c2T ?cd cdT CDCT - D CTSC
Spectral Decomposition is the factorization of a
positive definite matrix S into S CDCT where D
is a diagonal matrix of eigenvalues, and the C
matrix has the eigenvectors.
D(k?k) CT(k?d) S(d?d) C(d?k) d input space
dim k feature space dim
WT ? W
9Appendix A
- Another criterion for PCA is MMSE (Minimum
Mean-Squared Error) criterion which will reach
the same destination as the above two methods do.
But there may be an interesting difference among
them. - Some important properties of symmetric matrices
- Eigenvalues are all real.
- Symmetric matrices are diagonable.
- Eigenvectors are orthogonal.
- Eigenvectors corresponding to different
eigenvalues are orthogonal. - mk LIN eigenvectors corresponding to any
eigenvalue ?k of multiplicity mk can be obtained
such that they are orthogonal. - Rank equals number of nonzero eigenvalues.
10LDA Introduction
- LDA (Linear Discriminant Analysis) (Fisher, 1936)
(Rao, 1935) is a supervised method for dimension
reduction for classification problem. - To obtain features suitable for speech sound
classification, the use of LDA was proposed
(Hunt, 1979). - Brown showed that the LDA transform is superior
to the PCA transform by using DHMM classifier and
incorporating context information (Brown, 1987). - The later researchers have applied LDA to DHMM
and CHMM speech recognition systems and have
reported improved performance on small vocabulary
tasks but with mixed results on large vocabulary
phoneme-based systems.
11LDA Assumptions
- LDA is related to the MLE (Maximum Likelihood
Estimation) of parameters for a Gaussian model,
with two a priori assumptions (Campbell, 1984). - First, all the class-discrimination information
resides in a p-dimensional subspace of the
n-dimensional feature space. - Second, the within-class variances are equal for
all classes. - Another notable assumption is that class
distributions is mixture of Gaussians (Hastie
Tibshirani, 1994). (Why not single Gaussian?) - That means LDA is optimal if the classes are
normally distributed. But we can still use LDA
for classification.
12LDA Methodology
- Criterion Given a set of sample vectors with
labeled (class) information, try to find a linear
transform W such that the ratio of average
between-class variation over average within-class
variation is maximal. - After projection, for all classes to be well
separated, we would like the means to be as far
apart as possible and the examples of classes be
scatteres in as small a region as possible.
13LDA Methodology (cont.)
- Let x be an n-dimensional feature vector. We seek
a linear transformation Rn ? Rp (p lt n) of the
form yp ?pTx, where ?p is an n?p matrix. Let ?
be a nonsingular n?n matrix used to define the
linear transformation y ?Tx. Lets partition as - First, we apply a nonsingular linear
transformation to x to obtain y ?Tx. Second, we
retain only the first p rows of y to give yp.
14LDA Methodology (cont.)
- Let there be a total of J classes, and let g(i) ?
1J indicate the class that is associated with
xi. Let xi be the set of training examples
available. - The sample mean
- The class sample means
- The class sample covariances
15LDA Methodology (cont.)
- The average within-class variation
- The average between-class variation
- The total sample covariance
16LDA Methodology (cont.)
- To get a p-dimensional transformation, we
maximize the ratio - To obtain , we choose those eigenvectors of
that correspond to the largest p
eigenvalues, and let be an n?p matrix of
these eigenvectors. The p-dimensional features
thus obtained by that are
uncorrelated.
17HLDA ML framework
- For LDA, since the final objective is
classification, the implicit assumption is that
the rejected subspace does not carry any
classification information. - For Gaussian models, the assumption of lack of
classification information is equivalent to the
assumption that the means and the variances of
the class distributions are the same for all
classes in the rejected (n-p)-dimensional
subspace. - Now in an alternative way, let the full rank
linear transformation ? be such that the first p
columns of ? span the p-dimensional subspace in
which the class means, and probably the class
variances, are different. - When rank(?n?n) n, ? is said to have full rank,
or to be of full rank. Its rank equals its order,
it is nonsingular, its inverse exists. - ? obtained by LDA can be full rank?
- Since the data variables x are Gaussian, their
linear transformation y are also Gaussian.
18HLDA ML framework (cont.)
- The goal of HLDA (Heteroscedastic Discriminant
Analysis) is to generalize LDA under ML (Maximum
Likelihood) framework. - For notational convenience, we definewh
ere µj represents the class means and Sj
represents the class covariances after
transformation.
19HLDA ML framework (cont.)
- The probability density of xi under the preceding
model is given aswhere xi belongs to the
group g(i). Note that although the Gaussian
distribution is defined on the transformed
variable yi, we are interested in maximizing the
likelihood of the original data xi. - The term ? comes from the Jacobian of the
linear transformation y ?Tx.
20HLDA ML framework (cont.)
21HLDA Full rank
- The log-likelihood of the data
underthe linear transformation ? and
under the constrained Gaussian model assumption
for each class is - Doing a straightforward maximization with respect
to various parameters is computationally
intensive. (Why?)
22HLDA Full rank (cont.)
- We simplify it considerably by first calculating
the values of the mean and variance parameters
that maximize the likelihood in terms of a fixed
linear transformation ?. - Well get
- Transformations vs. ML estimators?
23HLDA Full rank (cont.)
- By replacing the two parameters in terms of ?,
the log-likelihood will be
24HLDA Full rank (cont.)
- We can simplify the above log-likelihood to get
? - Proposition 1 Let F be any full-rank n?n matrix.
Let t be any (n?p) rank-p matrix (p lt n). Then,
Trace(t(tTFt)-1tTF) p. - Proposition 2
25HLDA Full rank (cont.)
- Since there is no closed-form solution for
maximizing the likelihood with respect to h, the
maximization has to be performed numerically. - An initial guess of ? the LDA solution.
- Quadratic programming algorithms in MATLABTM
tool-box. - After optimization, we use only the first p
columns of ? ? to obtain the dimension-reduction
transformation.
26HLDA Diagonal
- In speech recognition, we often assume that the
within-class variances are diagonal. - The log-likelihood of the data can be written as
27HLDA Diagonal (cont.)
- Using the same method as before, and maximizing
the likelihood with respect to means and
variances, we get
28HLDA Diagonal (cont.)
- Substituting values of the maximized mean and
variance parameters gives the maximized
likelihood of the data in terms of ?.
29HLDA Diagonal (cont.)
- We can simplify this maximization to the following
30HLDA with equal parameters
- We finally consider the case where every class
has an equal covariance matrix. Then, the
maximum-likelihood parameter estimates can be
written as follows
31HLDA with equal parameters (cont.)
- The solution that we obtain by taking the
eigenvectors corresponding to largest p
eigenvalues of also maximizes the
expression above, thus asserting the claim that
LDA is the maximum-likelihood parameter estimate
of a constrained model.
32HDA Introduction
- The same as HLDA, the essence of HDA
(Heteroscedastic Discriminant Analysis) to remove
the equal within-class covariance constraint. - HDA defines an objective function similar to
LDAs which maximizes the class discrimination in
the projected subspace while ignoring the
rejected dimensions. - The assumptions of HDA
- Being the intuitive heteroscedastic extension of
LDA, HDA shares the same assumptions as LDA
(Chang, 2005). But why? - First, all of the classification information
lies in the first p-dimensional feature subspace. - Second, every class distribution is normal.
33HDA Derivation
- Considering the uniform class specific variance
assumption removed for HDA, then well try to
maximize - By taking log and rearranging terms, we get
34HDA Derivation (cont.)
- H has useful properties of invariance
- For every nonsingular matrix f, H(f?) H(?)
This means that subsequent feature space
transformations of the range of ? will not affect
the value of the objective function. So, like
LDA, the HDA solution is invariant to linear
transformations of the data in the original
space. - No special provisions have to be made for ?
during the optimization of H except for ?T? !
0. - The objective function is invariant to row or
column scalings of ? or eigenvalue scalings of
?T?. - Using matrix differentiation, the derivative of H
is given by - There is no close-form solution for H(?) 0.
- Instead, we used a quasi-Newton conjugate
gradient descent routine from the NAG2 Fortran
library for the optimization of H.
35HDA Derivation (cont.)
36HDA Likelihood interpretation
- Assuming a single full covariance Gaussian model
for each class, the log likelihood of these
samples according to the induced ML model. - It may be seen that the summation in H is related
to the log likelihood of the projected samples.
Thus, ? can be interpreted as a constrained ML
projection, the constraint being given by the
maximization of the projected between-class
scatter volume.
37HDA diagonal variance
- Consider the case when diagonal variance modeling
constraints are present in the final feature
space. - MLLT (Maximum Likelihood Linear Transform) is
introduced when the dimensions of the original
and the projected space are the same. - MLLT aims at minimizing the loss in likelihood
between full and diagonal covariance Gaussian
models. - The objective is to find a transformation f, that
maximizes the log likelihood difference of the
data
38HDA HDA vs. HLDA
- Consider the diagonal constraint in the projected
feature space - For HDA
- For HLDA
To be maximized
To be minimized