PCA, LDA, HLDA and HDA - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

PCA, LDA, HLDA and HDA

Description:

S. R. Searle, 'Matrix Algebra Useful for ... Berlin Chen's Sliders. ... X. Liu, 'Linear Projection Schemes for Automatic Speech Recognition,' Master of ... – PowerPoint PPT presentation

Number of Views:297

Avg rating:3.0/5.0

Slides: 39

Provided by: XUAN87

Category:

more less

Transcript and Presenter's Notes

Title: PCA, LDA, HLDA and HDA

1
PCA, LDA, HLDA and HDA

Reference
E. Alpaydin, Introduction to Machine Learning,
The MIT Press, 2004.
S. R. Searle, Matrix Algebra Useful for
Statistics, Wiley Series in Probability and
Mathematical Statistics, New Yourk, 1982.
Berlin Chens Sliders.
N. Kumar and A. G. Andreou, Heteroscedastic
Discriminant Analysis and Redued Rank HMMs for
Improved Speech Recognition, Speech
Communication, 26283-297, 1998.
G. Saon, M. Padmanabhan, R. Gopinath and S. Chen,
Maximum Likelihood Discriminant Feature Spaces,
ICASSP, 2000.
X. Liu, Linear Projection Schemes for Automatic
Speech Recognition, Master of Philosophy,
University of Cambridge, 2001.
???, ????????????????????????????, ????, 2005.

2
PCA Introduction

PCA (Principle Component Analysis) is a one-group
and unsupervised projection method for reducing
data dimensionality (feature extraction).
We make use of PCA to find a mapping from the
inputs in the original d-dimensional space to a
new (k lt d)-dimensional space, with
minimum loss of information.
maximum amount of information measured in terms
of variability.
That is, well find the new variables or linear
transformations (major axes, principal
components) to reach the goal.
The projection of x on the direction of w is, z
wTx.

3
PCA Methodology (General)

For maximizing the amount of information, PCA
centers the sample and then rotates the axes to
line up with the directions of highest variance.
That is, find w such that Var(z) is maximized,
which is the criterion.
Var(z) Var(wTx) E(wTx wTµ)2
E(wTx wTµ)(wTx wTµ)
EwT(x µ)(x µ)Tw
wT E(x µ)(x µ)Tw wT ? w

Var(x) E(x µ)2 E(x µ)(x µ)T ?
4
PCA Methodology (General) (cont.)

Maximize Var(z1) w1T ? w1 subject to w1
1.
For a unique solution and to make the direction
the important factor, we require w 1, i.e.,
w1Tw1 1.
w1 is an eigenvector of ?, ? is an eigenvalue
associated with w1.
Var(z1) Var(w1Tx) w1T ? w1 w1T ? w1 ?
w1Tw1 ?
max Var(z1) max ?, so choose the one with the
largest eigenvalue for Var(z) to be max.

? is Lagrange multiplier.
5
PCA Methodology (General) (cont.)

Second principal component max Var(z2), s.t.,
w21 and orthogonal to w1.
That is, max Var(z1) max ?.
Conclusions
wi is the eigenvector of ? associated with the
ith largest eigenvalue. Var(zi) ?i. (The above
can be proved by mathematical induction.)
wis are uncorrelated (orthogonal).
w1 explains as much as possible of original
variance in data set.
w2 explains as much as possible of remaining
variance, etc.

6
PCA Some Discussions

About dimensions
If the dimensions are highly correlated, there
will be a small number of eigenvectors with large
eigenvalues, and k will be much smaller than d
and a large reduction in dimensionality may be
attained.
If the dimensions are not correlated, k will be
as large as d and there is no gain through PCA.

7
PCA Some Discussions (cont.)

About ?
For two different eigenvalues, the eigenvectors
are orthogonal.
If ? is positive definite (xT ? x gt 0, x !
null), all eigenvalues are positive.
If ? is singular, then its rank, the effective
dimension is k lt d, and ?i 0, i gt k.
About scaling
Different variables have completely different
scaling.
Eigenvalues of the matrix is scale dependent.
If scale of the data is unknown then it is better
to use correlation matrix instead of the
covariance matrix.
The interpretation of the principal components
derived by these two methods can be completely
different.

8
PCA Methodology (SD)

z WTx, or z WT(x - m) center the data on the
origin.
To find a matrix w such that when we have z
WTx, we will get Cov(z) D is any diagonal
matrix.
We would like to get uncorrelated zi.
Let C ci be the normalized eigenvectors of S,
then
CTC I
S SCCT S (c1, c2, , cd) CT (Sc1,
Sc2, , Scd) CT (?c1, ?c2, , ?cd) CT
?c1 c1T ?c2 c2T ?cd cdT CDCT
D CTSC

Spectral Decomposition is the factorization of a
positive definite matrix S into S CDCT where D
is a diagonal matrix of eigenvalues, and the C
matrix has the eigenvectors.
D(k?k) CT(k?d) S(d?d) C(d?k) d input space
dim k feature space dim
WT ? W
9
Appendix A

Another criterion for PCA is MMSE (Minimum
Mean-Squared Error) criterion which will reach
the same destination as the above two methods do.
But there may be an interesting difference among
them.
Some important properties of symmetric matrices
Eigenvalues are all real.
Symmetric matrices are diagonable.
Eigenvectors are orthogonal.
Eigenvectors corresponding to different
eigenvalues are orthogonal.
mk LIN eigenvectors corresponding to any
eigenvalue ?k of multiplicity mk can be obtained
such that they are orthogonal.
Rank equals number of nonzero eigenvalues.

10
LDA Introduction

LDA (Linear Discriminant Analysis) (Fisher, 1936)
(Rao, 1935) is a supervised method for dimension
reduction for classification problem.
To obtain features suitable for speech sound
classification, the use of LDA was proposed
(Hunt, 1979).
Brown showed that the LDA transform is superior
to the PCA transform by using DHMM classifier and
incorporating context information (Brown, 1987).
The later researchers have applied LDA to DHMM
and CHMM speech recognition systems and have
reported improved performance on small vocabulary
tasks but with mixed results on large vocabulary
phoneme-based systems.

11
LDA Assumptions

LDA is related to the MLE (Maximum Likelihood
Estimation) of parameters for a Gaussian model,
with two a priori assumptions (Campbell, 1984).
First, all the class-discrimination information
resides in a p-dimensional subspace of the
n-dimensional feature space.
Second, the within-class variances are equal for
all classes.
Another notable assumption is that class
distributions is mixture of Gaussians (Hastie
Tibshirani, 1994). (Why not single Gaussian?)
That means LDA is optimal if the classes are
normally distributed. But we can still use LDA
for classification.

12
LDA Methodology

Criterion Given a set of sample vectors with
labeled (class) information, try to find a linear
transform W such that the ratio of average
between-class variation over average within-class
variation is maximal.
After projection, for all classes to be well
separated, we would like the means to be as far
apart as possible and the examples of classes be
scatteres in as small a region as possible.

13
LDA Methodology (cont.)

Let x be an n-dimensional feature vector. We seek
a linear transformation Rn ? Rp (p lt n) of the
form yp ?pTx, where ?p is an n?p matrix. Let ?
be a nonsingular n?n matrix used to define the
linear transformation y ?Tx. Lets partition as
First, we apply a nonsingular linear
transformation to x to obtain y ?Tx. Second, we
retain only the first p rows of y to give yp.

14
LDA Methodology (cont.)

Let there be a total of J classes, and let g(i) ?
1J indicate the class that is associated with
xi. Let xi be the set of training examples
available.
The sample mean
The class sample means
The class sample covariances

15
LDA Methodology (cont.)

The average within-class variation
The average between-class variation
The total sample covariance

16
LDA Methodology (cont.)

To get a p-dimensional transformation, we
maximize the ratio
To obtain , we choose those eigenvectors of
that correspond to the largest p
eigenvalues, and let be an n?p matrix of
these eigenvectors. The p-dimensional features
thus obtained by that are
uncorrelated.

17
HLDA ML framework

For LDA, since the final objective is
classification, the implicit assumption is that
the rejected subspace does not carry any
classification information.
For Gaussian models, the assumption of lack of
classification information is equivalent to the
assumption that the means and the variances of
the class distributions are the same for all
classes in the rejected (n-p)-dimensional
subspace.
Now in an alternative way, let the full rank
linear transformation ? be such that the first p
columns of ? span the p-dimensional subspace in
which the class means, and probably the class
variances, are different.
When rank(?n?n) n, ? is said to have full rank,
or to be of full rank. Its rank equals its order,
it is nonsingular, its inverse exists.
? obtained by LDA can be full rank?
Since the data variables x are Gaussian, their
linear transformation y are also Gaussian.

18
HLDA ML framework (cont.)

The goal of HLDA (Heteroscedastic Discriminant
Analysis) is to generalize LDA under ML (Maximum
Likelihood) framework.
For notational convenience, we definewh
ere µj represents the class means and Sj
represents the class covariances after
transformation.

19
HLDA ML framework (cont.)

The probability density of xi under the preceding
model is given aswhere xi belongs to the
group g(i). Note that although the Gaussian
distribution is defined on the transformed
variable yi, we are interested in maximizing the
likelihood of the original data xi.
The term ? comes from the Jacobian of the
linear transformation y ?Tx.

20
HLDA ML framework (cont.)
21
HLDA Full rank

The log-likelihood of the data
underthe linear transformation ? and
under the constrained Gaussian model assumption
for each class is
Doing a straightforward maximization with respect
to various parameters is computationally
intensive. (Why?)

22
HLDA Full rank (cont.)

We simplify it considerably by first calculating
the values of the mean and variance parameters
that maximize the likelihood in terms of a fixed
linear transformation ?.
Well get
Transformations vs. ML estimators?

23
HLDA Full rank (cont.)

By replacing the two parameters in terms of ?,
the log-likelihood will be

24
HLDA Full rank (cont.)

We can simplify the above log-likelihood to get
?
Proposition 1 Let F be any full-rank n?n matrix.
Let t be any (n?p) rank-p matrix (p lt n). Then,
Trace(t(tTFt)-1tTF) p.
Proposition 2

25
HLDA Full rank (cont.)

Since there is no closed-form solution for
maximizing the likelihood with respect to h, the
maximization has to be performed numerically.
An initial guess of ? the LDA solution.
Quadratic programming algorithms in MATLABTM
tool-box.
After optimization, we use only the first p
columns of ? ? to obtain the dimension-reduction
transformation.

26
HLDA Diagonal

In speech recognition, we often assume that the
within-class variances are diagonal.
The log-likelihood of the data can be written as

27
HLDA Diagonal (cont.)

Using the same method as before, and maximizing
the likelihood with respect to means and
variances, we get

28
HLDA Diagonal (cont.)

Substituting values of the maximized mean and
variance parameters gives the maximized
likelihood of the data in terms of ?.

29
HLDA Diagonal (cont.)

We can simplify this maximization to the following

30
HLDA with equal parameters

We finally consider the case where every class
has an equal covariance matrix. Then, the
maximum-likelihood parameter estimates can be
written as follows

31
HLDA with equal parameters (cont.)

The solution that we obtain by taking the
eigenvectors corresponding to largest p
eigenvalues of also maximizes the
expression above, thus asserting the claim that
LDA is the maximum-likelihood parameter estimate
of a constrained model.

32
HDA Introduction

The same as HLDA, the essence of HDA
(Heteroscedastic Discriminant Analysis) to remove
the equal within-class covariance constraint.
HDA defines an objective function similar to
LDAs which maximizes the class discrimination in
the projected subspace while ignoring the
rejected dimensions.
The assumptions of HDA
Being the intuitive heteroscedastic extension of
LDA, HDA shares the same assumptions as LDA
(Chang, 2005). But why?
First, all of the classification information
lies in the first p-dimensional feature subspace.
Second, every class distribution is normal.

33
HDA Derivation

Considering the uniform class specific variance
assumption removed for HDA, then well try to
maximize
By taking log and rearranging terms, we get

34
HDA Derivation (cont.)

H has useful properties of invariance
For every nonsingular matrix f, H(f?) H(?)
This means that subsequent feature space
transformations of the range of ? will not affect
the value of the objective function. So, like
LDA, the HDA solution is invariant to linear
transformations of the data in the original
space.
No special provisions have to be made for ?
during the optimization of H except for ?T? !
0.
The objective function is invariant to row or
column scalings of ? or eigenvalue scalings of
?T?.
Using matrix differentiation, the derivative of H
is given by
There is no close-form solution for H(?) 0.
Instead, we used a quasi-Newton conjugate
gradient descent routine from the NAG2 Fortran
library for the optimization of H.

35
HDA Derivation (cont.)
36
HDA Likelihood interpretation

Assuming a single full covariance Gaussian model
for each class, the log likelihood of these
samples according to the induced ML model.
It may be seen that the summation in H is related
to the log likelihood of the projected samples.
Thus, ? can be interpreted as a constrained ML
projection, the constraint being given by the
maximization of the projected between-class
scatter volume.

37
HDA diagonal variance

Consider the case when diagonal variance modeling
constraints are present in the final feature
space.
MLLT (Maximum Likelihood Linear Transform) is
introduced when the dimensions of the original
and the projected space are the same.
MLLT aims at minimizing the loss in likelihood
between full and diagonal covariance Gaussian
models.
The objective is to find a transformation f, that
maximizes the log likelihood difference of the
data

38
HDA HDA vs. HLDA