Title: (Semi-)Supervised Probabilistic Principal Component Analysis
1(Semi-)Supervised Probabilistic Principal
Component Analysis
- Shipeng Yu
- University of Munich, Germany
- Siemens Corporate Technology
- http//www.dbs.ifi.lmu.de/spyu
- Joint work with Kai Yu, Volker Tresp,
- Hans-Peter Kriegel, Mingrui Wu
2Dimensionality Reduction
- We are dealing with high-dimensional data
- Texts e.g. bag-of-words features
- Images color histogram, correlagram, etc.
- Web pages texts, linkages, structures, etc.
- Motivations
- Noisy features how to remove or down-weight
them? - Learnability curse of dimensionality
- Inefficiency high computational cost
- Visualization
- A pre-processing for many data mining tasks
3Unsupervised versus Supervised
- Unsupervised Dimensionality Reduction
- Only the input data are given
- PCA (principal component analysis)
- Supervised Dimensionality Reduction
- Should be biased by the outputs
- Classification FDA (Fisher discriminant
analysis) - Regression PLS (partial least squares)
- RVs CCA (canonical correlation analysis)
- More general solutions?
- Semi-Supervised?
4Our Settings and Notations
- data points, input features, output
labels - We aim to derive a mapping such
that
Unlabeled Data!
Unsupervised
Semi-supervised
Supervised
5Outline
- Principal Component Analysis
- Probabilistic PCA
- Supervised Probabilistic PCA
- Related Work
- Conclusion
6PCA Motivation
- Find the K orthogonal projection directions which
have the most data variances - Applications
- Visualization
- De-noising
- Latent semantic indexing
- Eigenfaces
-
1st PC
2nd PC
7PCA Algorithm
- Basic Algorithm
- Centralize data
- Compute the sample covariance matrix
- Do eigen-decomposition (sort eigenvalues
decreasingly) - PCA directions are given in , the first K
columns of - The PCA projection of a test data is
8Supervised PCA?
- PCA is unsupervised
- When output information is available
- Classification labels 0/1
- Regression responses real values
- Ranking orders rank labels / preferences
- Multi-outputs output dimension gt 1
- Structured outputs,
- Can PCA be biased by outputs?
- And how?
9Outline
- Principal Component Analysis
- Probabilistic PCA
- Supervised Probabilistic PCA
- Related Work
- Conclusion
10Latent Variable Model for PCA
- Another interpretation of PCA Pearson 1901
- PCA is minimizing the reconstruction error of
- are latent variables PCA
projections of - are factor loadings PCA mappings
- Equivalent to PCA up to a scaling factor
- Lead to idea of PPCA
11Probabilistic PCA TipBis99
- Latent variable model
- Conditional independence
- If , PPCA leads to PCA solution (up to
a rotation and scaling factor) - is Gaussian distributed
Mean vector
12From Unsupervised to Supervised
- Key insights of PPCA
- All the M input dimensions are conditionally
independent given the K latent variables - In PCA we are seeking the K latent variables that
best explain the data covariance - When we have outputs , we believe
- There are inter-covariances between and
- There are intra-covariances within if
- Idea Let the latent variables explain all of
them!
13Outline
- Principal Component Analysis
- Probabilistic PCA
- Supervised Probabilistic PCA
- Related Work
- Conclusion
14Supervised Probabilistic PCA
- Supervised latent variable model
- All input and output dimensions are conditionally
independent - are jointly Gaussian distributed
15Semi-Supervised Probabilistic PCA
- Idea A SPPCA model with missing data!
- Likelihood
16S2PPCA EM Learning
- Model parameters
- EM Learning
- E-step estimate for each data point (a
projection problem) - M-step maximize data log-likelihood w.r.t.
parameters - An extension of EM learning for PPCA model
- Can be kernelized!
- By product an EM learning algorithm for kernel
PCA
Inference Problem
17S2PPCA Toy Problem - Linear
18S2PPCA Toy Problem - Nonlinear
19S2PPCA Toy Problem - Nonlinear
20S2PPCA Properties
- Semi-supervised projection
- Take PCA and kernel PCA as special cases
- Applicable to large data sets
- Primal O(t(ML)NK) time, O((ML)N) space
- Dual O(tN2K) time, O(N2) space
- A latent variable solution Yu et al, SIGIR05
- Cannot deal with semi-supervised setting!
- Closed form solutions for SPPCA
- No closed form solutions for S2PPCA
cheaper than Primal if MgtN
21SPPCA Primal Solution
(ML)(ML)
22SPPCA Dual Solution
New kernel matrix!
23Experiments
- Train Nearest Neighbor classifier after
projection - Evaluation metrics
- Multi-class classification error rate
- Multi-label classification F1-measure, AUC
24Multi-class Classification
- S2PPCA is almost always better than SPPCA
- LDA is very good for FACE data
- S2PPCA is very good on TEXT data
- S2PPCA has good scalability
25Multi-label Classification
- S2PPCA is always better than SPPCA
- S2PPCA is better in most cases
- S2PPCA has good scalability
26Extensions
- Put priors on factor loading matrices
- Learn MAP estimates for them
- Relax Gaussian noise model for outputs
- Better way to incorporate supervised information
- We need to do more approximations (using, e.g.
EP) - Directly predict missing outputs (i.e. single
step) - Mixture modeling in latent variable space
- Achieve local PCA instead of global PCA
- Robust supervised PCA mapping
- Replace Gaussian with Student-t
- Outlier detection in PCA
27Related Work
- Fisher discriminant analysis (FDA)
- Goal Find directions to maximize between-class
distance while minimizing within-class distance - Only deal with outputs of multi-class
classification - Limited number of projection dimensions
- Canonical correlation analysis (CCA)
- Goal Maximize the correlation between inputs and
outputs - Ignore intra-covariance of both inputs and
outputs - Partial least squares (PLS)
- Goal Sequentially find orthogonal directions to
maximize covariance with respect to outputs - A penalized CCA poor generalization on new
output dimensions
28Conclusion
- A general framework for (semi-)supervised
dimensionality reduction - We can solve the problem analytically (EIG) or
via iterative algorithms (EM) - Trade-off
- EIG optimization-based, easily extended to other
loss - EM semi-supervised projection, good scalability
- Both algorithms can be kernelized
- PCA and kernel PCA are special cases
- More applications expected
29Thank you!
http//www.dbs.ifi.lmu.de/spyu