PCA, LDA, HLDA and HDA - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

PCA, LDA, HLDA and HDA

Description:

S. R. Searle, 'Matrix Algebra Useful for ... Berlin Chen's Sliders. ... X. Liu, 'Linear Projection Schemes for Automatic Speech Recognition,' Master of ... – PowerPoint PPT presentation

Number of Views:297
Avg rating:3.0/5.0
Slides: 39
Provided by: XUAN87
Category:
Tags: hda | hlda | lda | pca | sliders

less

Transcript and Presenter's Notes

Title: PCA, LDA, HLDA and HDA


1
PCA, LDA, HLDA and HDA
  • Reference
  • E. Alpaydin, Introduction to Machine Learning,
    The MIT Press, 2004.
  • S. R. Searle, Matrix Algebra Useful for
    Statistics, Wiley Series in Probability and
    Mathematical Statistics, New Yourk, 1982.
  • Berlin Chens Sliders.
  • N. Kumar and A. G. Andreou, Heteroscedastic
    Discriminant Analysis and Redued Rank HMMs for
    Improved Speech Recognition, Speech
    Communication, 26283-297, 1998.
  • G. Saon, M. Padmanabhan, R. Gopinath and S. Chen,
    Maximum Likelihood Discriminant Feature Spaces,
    ICASSP, 2000.
  • X. Liu, Linear Projection Schemes for Automatic
    Speech Recognition, Master of Philosophy,
    University of Cambridge, 2001.
  • ???, ????????????????????????????, ????, 2005.

2
PCA Introduction
  • PCA (Principle Component Analysis) is a one-group
    and unsupervised projection method for reducing
    data dimensionality (feature extraction).
  • We make use of PCA to find a mapping from the
    inputs in the original d-dimensional space to a
    new (k lt d)-dimensional space, with
  • minimum loss of information.
  • maximum amount of information measured in terms
    of variability.
  • That is, well find the new variables or linear
    transformations (major axes, principal
    components) to reach the goal.
  • The projection of x on the direction of w is, z
    wTx.

3
PCA Methodology (General)
  • For maximizing the amount of information, PCA
    centers the sample and then rotates the axes to
    line up with the directions of highest variance.
  • That is, find w such that Var(z) is maximized,
    which is the criterion.
  • Var(z) Var(wTx) E(wTx wTµ)2
    E(wTx wTµ)(wTx wTµ)
  • EwT(x µ)(x µ)Tw
  • wT E(x µ)(x µ)Tw wT ? w

Var(x) E(x µ)2 E(x µ)(x µ)T ?
4
PCA Methodology (General) (cont.)
  • Maximize Var(z1) w1T ? w1 subject to w1
    1.
  • For a unique solution and to make the direction
    the important factor, we require w 1, i.e.,
    w1Tw1 1.
  • w1 is an eigenvector of ?, ? is an eigenvalue
    associated with w1.
  • Var(z1) Var(w1Tx) w1T ? w1 w1T ? w1 ?
    w1Tw1 ?
  • max Var(z1) max ?, so choose the one with the
    largest eigenvalue for Var(z) to be max.

? is Lagrange multiplier.
5
PCA Methodology (General) (cont.)
  • Second principal component max Var(z2), s.t.,
    w21 and orthogonal to w1.
  • That is, max Var(z1) max ?.
  • Conclusions
  • wi is the eigenvector of ? associated with the
    ith largest eigenvalue. Var(zi) ?i. (The above
    can be proved by mathematical induction.)
  • wis are uncorrelated (orthogonal).
  • w1 explains as much as possible of original
    variance in data set.
  • w2 explains as much as possible of remaining
    variance, etc.

6
PCA Some Discussions
  • About dimensions
  • If the dimensions are highly correlated, there
    will be a small number of eigenvectors with large
    eigenvalues, and k will be much smaller than d
    and a large reduction in dimensionality may be
    attained.
  • If the dimensions are not correlated, k will be
    as large as d and there is no gain through PCA.

7
PCA Some Discussions (cont.)
  • About ?
  • For two different eigenvalues, the eigenvectors
    are orthogonal.
  • If ? is positive definite (xT ? x gt 0, x !
    null), all eigenvalues are positive.
  • If ? is singular, then its rank, the effective
    dimension is k lt d, and ?i 0, i gt k.
  • About scaling
  • Different variables have completely different
    scaling.
  • Eigenvalues of the matrix is scale dependent.
  • If scale of the data is unknown then it is better
    to use correlation matrix instead of the
    covariance matrix.
  • The interpretation of the principal components
    derived by these two methods can be completely
    different.

8
PCA Methodology (SD)
  • z WTx, or z WT(x - m) center the data on the
    origin.
  • To find a matrix w such that when we have z
    WTx, we will get Cov(z) D is any diagonal
    matrix.
  • We would like to get uncorrelated zi.
  • Let C ci be the normalized eigenvectors of S,
    then
  • CTC I
  • S SCCT S (c1, c2, , cd) CT (Sc1,
    Sc2, , Scd) CT (?c1, ?c2, , ?cd) CT
    ?c1 c1T ?c2 c2T ?cd cdT CDCT
  • D CTSC

Spectral Decomposition is the factorization of a
positive definite matrix S into S CDCT where D
is a diagonal matrix of eigenvalues, and the C
matrix has the eigenvectors.
D(k?k) CT(k?d) S(d?d) C(d?k) d input space
dim k feature space dim
WT ? W
9
Appendix A
  • Another criterion for PCA is MMSE (Minimum
    Mean-Squared Error) criterion which will reach
    the same destination as the above two methods do.
    But there may be an interesting difference among
    them.
  • Some important properties of symmetric matrices
  • Eigenvalues are all real.
  • Symmetric matrices are diagonable.
  • Eigenvectors are orthogonal.
  • Eigenvectors corresponding to different
    eigenvalues are orthogonal.
  • mk LIN eigenvectors corresponding to any
    eigenvalue ?k of multiplicity mk can be obtained
    such that they are orthogonal.
  • Rank equals number of nonzero eigenvalues.

10
LDA Introduction
  • LDA (Linear Discriminant Analysis) (Fisher, 1936)
    (Rao, 1935) is a supervised method for dimension
    reduction for classification problem.
  • To obtain features suitable for speech sound
    classification, the use of LDA was proposed
    (Hunt, 1979).
  • Brown showed that the LDA transform is superior
    to the PCA transform by using DHMM classifier and
    incorporating context information (Brown, 1987).
  • The later researchers have applied LDA to DHMM
    and CHMM speech recognition systems and have
    reported improved performance on small vocabulary
    tasks but with mixed results on large vocabulary
    phoneme-based systems.

11
LDA Assumptions
  • LDA is related to the MLE (Maximum Likelihood
    Estimation) of parameters for a Gaussian model,
    with two a priori assumptions (Campbell, 1984).
  • First, all the class-discrimination information
    resides in a p-dimensional subspace of the
    n-dimensional feature space.
  • Second, the within-class variances are equal for
    all classes.
  • Another notable assumption is that class
    distributions is mixture of Gaussians (Hastie
    Tibshirani, 1994). (Why not single Gaussian?)
  • That means LDA is optimal if the classes are
    normally distributed. But we can still use LDA
    for classification.

12
LDA Methodology
  • Criterion Given a set of sample vectors with
    labeled (class) information, try to find a linear
    transform W such that the ratio of average
    between-class variation over average within-class
    variation is maximal.
  • After projection, for all classes to be well
    separated, we would like the means to be as far
    apart as possible and the examples of classes be
    scatteres in as small a region as possible.

13
LDA Methodology (cont.)
  • Let x be an n-dimensional feature vector. We seek
    a linear transformation Rn ? Rp (p lt n) of the
    form yp ?pTx, where ?p is an n?p matrix. Let ?
    be a nonsingular n?n matrix used to define the
    linear transformation y ?Tx. Lets partition as
  • First, we apply a nonsingular linear
    transformation to x to obtain y ?Tx. Second, we
    retain only the first p rows of y to give yp.

14
LDA Methodology (cont.)
  • Let there be a total of J classes, and let g(i) ?
    1J indicate the class that is associated with
    xi. Let xi be the set of training examples
    available.
  • The sample mean
  • The class sample means
  • The class sample covariances

15
LDA Methodology (cont.)
  • The average within-class variation
  • The average between-class variation
  • The total sample covariance

16
LDA Methodology (cont.)
  • To get a p-dimensional transformation, we
    maximize the ratio
  • To obtain , we choose those eigenvectors of
    that correspond to the largest p
    eigenvalues, and let be an n?p matrix of
    these eigenvectors. The p-dimensional features
    thus obtained by that are
    uncorrelated.

17
HLDA ML framework
  • For LDA, since the final objective is
    classification, the implicit assumption is that
    the rejected subspace does not carry any
    classification information.
  • For Gaussian models, the assumption of lack of
    classification information is equivalent to the
    assumption that the means and the variances of
    the class distributions are the same for all
    classes in the rejected (n-p)-dimensional
    subspace.
  • Now in an alternative way, let the full rank
    linear transformation ? be such that the first p
    columns of ? span the p-dimensional subspace in
    which the class means, and probably the class
    variances, are different.
  • When rank(?n?n) n, ? is said to have full rank,
    or to be of full rank. Its rank equals its order,
    it is nonsingular, its inverse exists.
  • ? obtained by LDA can be full rank?
  • Since the data variables x are Gaussian, their
    linear transformation y are also Gaussian.

18
HLDA ML framework (cont.)
  • The goal of HLDA (Heteroscedastic Discriminant
    Analysis) is to generalize LDA under ML (Maximum
    Likelihood) framework.
  • For notational convenience, we definewh
    ere µj represents the class means and Sj
    represents the class covariances after
    transformation.

19
HLDA ML framework (cont.)
  • The probability density of xi under the preceding
    model is given aswhere xi belongs to the
    group g(i). Note that although the Gaussian
    distribution is defined on the transformed
    variable yi, we are interested in maximizing the
    likelihood of the original data xi.
  • The term ? comes from the Jacobian of the
    linear transformation y ?Tx.

20
HLDA ML framework (cont.)
21
HLDA Full rank
  • The log-likelihood of the data
    underthe linear transformation ? and
    under the constrained Gaussian model assumption
    for each class is
  • Doing a straightforward maximization with respect
    to various parameters is computationally
    intensive. (Why?)

22
HLDA Full rank (cont.)
  • We simplify it considerably by first calculating
    the values of the mean and variance parameters
    that maximize the likelihood in terms of a fixed
    linear transformation ?.
  • Well get
  • Transformations vs. ML estimators?

23
HLDA Full rank (cont.)
  • By replacing the two parameters in terms of ?,
    the log-likelihood will be

24
HLDA Full rank (cont.)
  • We can simplify the above log-likelihood to get
    ?
  • Proposition 1 Let F be any full-rank n?n matrix.
    Let t be any (n?p) rank-p matrix (p lt n). Then,
    Trace(t(tTFt)-1tTF) p.
  • Proposition 2

25
HLDA Full rank (cont.)
  • Since there is no closed-form solution for
    maximizing the likelihood with respect to h, the
    maximization has to be performed numerically.
  • An initial guess of ? the LDA solution.
  • Quadratic programming algorithms in MATLABTM
    tool-box.
  • After optimization, we use only the first p
    columns of ? ? to obtain the dimension-reduction
    transformation.

26
HLDA Diagonal
  • In speech recognition, we often assume that the
    within-class variances are diagonal.
  • The log-likelihood of the data can be written as

27
HLDA Diagonal (cont.)
  • Using the same method as before, and maximizing
    the likelihood with respect to means and
    variances, we get

28
HLDA Diagonal (cont.)
  • Substituting values of the maximized mean and
    variance parameters gives the maximized
    likelihood of the data in terms of ?.

29
HLDA Diagonal (cont.)
  • We can simplify this maximization to the following

30
HLDA with equal parameters
  • We finally consider the case where every class
    has an equal covariance matrix. Then, the
    maximum-likelihood parameter estimates can be
    written as follows

31
HLDA with equal parameters (cont.)
  • The solution that we obtain by taking the
    eigenvectors corresponding to largest p
    eigenvalues of also maximizes the
    expression above, thus asserting the claim that
    LDA is the maximum-likelihood parameter estimate
    of a constrained model.

32
HDA Introduction
  • The same as HLDA, the essence of HDA
    (Heteroscedastic Discriminant Analysis) to remove
    the equal within-class covariance constraint.
  • HDA defines an objective function similar to
    LDAs which maximizes the class discrimination in
    the projected subspace while ignoring the
    rejected dimensions.
  • The assumptions of HDA
  • Being the intuitive heteroscedastic extension of
    LDA, HDA shares the same assumptions as LDA
    (Chang, 2005). But why?
  • First, all of the classification information
    lies in the first p-dimensional feature subspace.
  • Second, every class distribution is normal.

33
HDA Derivation
  • Considering the uniform class specific variance
    assumption removed for HDA, then well try to
    maximize
  • By taking log and rearranging terms, we get

34
HDA Derivation (cont.)
  • H has useful properties of invariance
  • For every nonsingular matrix f, H(f?) H(?)
    This means that subsequent feature space
    transformations of the range of ? will not affect
    the value of the objective function. So, like
    LDA, the HDA solution is invariant to linear
    transformations of the data in the original
    space.
  • No special provisions have to be made for ?
    during the optimization of H except for ?T? !
    0.
  • The objective function is invariant to row or
    column scalings of ? or eigenvalue scalings of
    ?T?.
  • Using matrix differentiation, the derivative of H
    is given by
  • There is no close-form solution for H(?) 0.
  • Instead, we used a quasi-Newton conjugate
    gradient descent routine from the NAG2 Fortran
    library for the optimization of H.

35
HDA Derivation (cont.)
36
HDA Likelihood interpretation
  • Assuming a single full covariance Gaussian model
    for each class, the log likelihood of these
    samples according to the induced ML model.
  • It may be seen that the summation in H is related
    to the log likelihood of the projected samples.
    Thus, ? can be interpreted as a constrained ML
    projection, the constraint being given by the
    maximization of the projected between-class
    scatter volume.

37
HDA diagonal variance
  • Consider the case when diagonal variance modeling
    constraints are present in the final feature
    space.
  • MLLT (Maximum Likelihood Linear Transform) is
    introduced when the dimensions of the original
    and the projected space are the same.
  • MLLT aims at minimizing the loss in likelihood
    between full and diagonal covariance Gaussian
    models.
  • The objective is to find a transformation f, that
    maximizes the log likelihood difference of the
    data

38
HDA HDA vs. HLDA
  • Consider the diagonal constraint in the projected
    feature space
  • For HDA
  • For HLDA

To be maximized
To be minimized
Write a Comment
User Comments (0)
About PowerShow.com