ClosedForm Supervised Dimensionality Reduction with Generalized Linear Models - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

ClosedForm Supervised Dimensionality Reduction with Generalized Linear Models

Description:

Francisco Pereira Princeton University, Princeton, NJ, USA ... Image courtesy of V. D. Calhoun and T. Adali, 'Unmixing fMRI with independent ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 31
Provided by: IBMU301
Category:

less

Transcript and Presenter's Notes

Title: ClosedForm Supervised Dimensionality Reduction with Generalized Linear Models


1
Closed-Form Supervised Dimensionality Reduction
with Generalized Linear Models

Irina Rish Genady Grabarnik IBM Watson
Research, Yorktown Heights, NY, USA Guillermo
Cecchi Francisco Pereira Princeton University,
Princeton, NJ, USA Geoffrey J. Gordon Carnegie
Mellon University, Pittsburgh, PA, USA
2
Outline
  • Why Supervised Dimensionality Reduction (SDR)?
  • General Framework for SDR
  • Simple and Efficient Algorithm
  • Empirical Results
  • Conclusions

3
Motivating Application Brain Imaging
(Functional MRI)
  • Acquires sequence of MRI images of brain activity
    by measuring changes in blood oxygenation level
  • High-dimensional, small-sample real-valued data
  • 10,000 to 100,000 of variables (voxels)
  • 100s of samples (time points)
  • Learning tasks
  • Predicting mental states
  • is a person looking at a face or a building?
  • Is a person listening to a French or a Korean
    sentence?
  • how angry (happy, sad, anxious) is a person?
  • Extracting patterns of a mental disease (e.g.
    schizophrenia, depression)
  • Predicting brain activity given stimuli (e.g.,
    words)
  • Issues Overfitting? Interpretability?

fMRI slicing image courtesy of fMRI Research
Center _at_ Columbia University
fMRI activation image and time-course - Courtesy
of Steve Smith, FMRIB
4
Other High-Dimensional Applications
  • Network connectivity data (sensor nets,
    peer-to-peer, Internet)
  • Collaborative prediction (rankings)
  • Microarray data
  • Text

Example peer-to-peer interaction data (IBM
downloadGrid content distribution system)
10913 clients
2746 servers

Variety of data types real-valued, binary,
nominal, non-negative A general
dimensionality-reduction framework?
5
Why Dimensionality Reduction?
  • Visualization
  • Noise reduction
  • Interpretability (component analysis)
  • Regularization to prevent overfitting
  • Computational complexity reduction
  • Numerous examples PCA, ICA, LDA, NMF, ePCA,
    logistic PCA, etc.

Image courtesy of V. D. Calhoun and T. Adali,
"Unmixing fMRI with independent component
analysis, Engineering in Medicine and Biology
Magazine, IEEE, March-April 2006.
6
Why SUPERVISED Dimensionality Reduction (SDR)?
PCA
LDA
In supervised case (data points with class
labels), standard dimensionality reduction such
as PCA may result into bad discriminative
performance, while supervised dimensionality
reduction such as Fishers Linear Discriminant
Analysis (LDA) may be able to separate the data
perfectly (although one may need to extend LDA
beyond 1D-projections)
7
SDR A General Framework
  • there is an inherent low-dimensional structure in
    the data that is predictive about the class
  • Both data and labels are high-dimensional
    stochastic functions over that shared structure
    (e.g., PCA linear Gaussian model)
  • We use Generalized Linear Models that assume
    exponential-family noise (Gaussian, Bernoulli,
    multinomial, exponential, etc)


YK
Y1

U1
UL

X1

XD
Our goal Learn a predictor U-gtY simultaneously
with reducing dimensionality (learning X-gtU)
Hypothesis Supervised DR works better than
unsupervised DR followed by learning a predictor
8
Related Work Particular X U and U Y
  • 1. F. Pereira and G. Gordon. The Support Vector
    Decomposition Machine, ICML-06.
  • Real-valued X, discrete Y (linear map from X to
    U, SVM for Y(U) )
  • 2. E. Xing, A. Ng, M. Jordan, and S. Russell.
    Distance metric learning with application to
    clustering with side information, NIPS-02.
  • 3. K. Weinberger, J. Blitzer and L. Saul.
    Distance Metric Learning for Large Margin Nearest
    Neighbor Classification, NIPS-05.
  • Real-valued X, discrete Y (linear map from X
    to U, nearest-neighbor Y(U))
  • 4. K. Weinberger and G. Tesauro.  Metric
    Learning for Kernel Regression, AISTATS-07.
  • Real-valued X, real-valued Y (linear
    map from X to U, kernel regression Y(U))
  • 5. Sajama and A. Orlitsky. Supervised
    Dimensionality Reduction using Mixture Models,
    ICML-05.
  • Multi-type X (exp.family), discrete Y
    (modeled as mixture of exp-family distributions)
  • 6. M. Collins, S. Dasgupta and R. Schapire. A
    generalization of PCA to the exponential family,
    NIPS-01.
  • 7. A. Schein, L. Saul and L. Ungar. A
    generalized linear model for PCA of binary data,
    AISTATS-03
  • Unsupervised dimensionality reduction
    beyond Gaussian data (nonlinear GLM mappings)

9
Our Contributions
  • General SDR framework that handles
  • mixed data types (continuous and discrete)
  • linear and nonlinear mappings produced by
    Generalized Linear Models
  • multiple label types (classification and
    regression)
  • multitask learning (multiple prediction problems
    at once)
  • semi-supervised learning, arbitrary missing
    labels
  • Simple closed-form iterative update rules
    no need for
    optimization performed at each iteration
    guaranteed convergence (to local
    minimum)
  • Currently available for any combinations of
    Gaussian, Bernoulli and multinomial variables
    (i.e., linear, logistic, and multinomial logistic
    regression models)

10
SDR Model Exponential-Family Distributions
with
Low-Dimensional Natural Parameters
11
Another View GLMs with Shared Hidden Data
12
SDR Optimization Problem
13
SDR Alternate Minimization Algorithm
14
Optimization via Auxiliary Functions
15
Auxiliary Functions for SDR-GLM
Gaussian log-likelihood auxiliary function
coincides with the objective function
Bernoulli log-likelihood use bound on logistic
function from Jaakkola and Jordan (1997)
Multinomial log-likelihood similar bound
recently derived by Bouchard, 2007
16
Key Idea Combining Auxiliary Functions
Stack together (known) auxiliary functions for X
and Y
17
Derivation of Closed-Form Update Rules
18
Empirical Evaluation on Simulated Data
  • Generate a separable 2-D dataset U (NxL, L2)
  • Assume unit basis vectors (rows of 2xD matrix V)
  • Compute natural parameters
  • Generate exponential-family D-dimensional data X

19
Bernoulli Noise
  • SDR greatly outperforms unsupervised DR (logistic
    PCA) SVM or logistic regression
  • Using proper data model (e.g., Bernoulli-SDR for
    binary data) really matters
  • SDR gets the structure (0 error), SVM does
    not (20 error)

20
Gaussian Noise
  • SDR greatly outperforms unsupervised DR (linear
    PCA) SVM or logistic regression
  • For Gaussian data, SDR and SVM are comparable
  • Both SDR and SVM outperforming SVDM (quadratic
    loss hinge loss)

21
Regularization Parameter (Weight on Data
Reconstruction Loss)
  • Empirical trend lowest errors achieved for
    relatively low values (0.1 and less)
  • Putting too much weight on reconstruction loss
    immediately worsens the performance
  • In all experiments, we use cross-validation to
    choose best regularization parameter

22
Real-life Data Sensor Network Connectivity
Binary data, Classification Task 41 light
sensors (nodes), 41 x 41 connectivity matrix
Given 40 columns, predict 41st (N41, D40,
K1) Latent dimensionality L 2, 4, 6, 8, 10
  • Bernoulli-SDR uses only 10 out of 41 dimensions
    to achieve 12 error (vs 17 by SVM)
  • Unsupervised DR followed by classifier is
    inferior to supervised DR
  • Again, appropriate data model (Bernoulli)
    outperforms less appropriate (Gaussian)

23
Real-life Data Mental state prediction from
fMRI
Real-valued data, Classification Task Predict
the type of word (tools or buildings) the subject
is seeing 84 samples (words presented to a
subject), 14043 dimensions (voxels) Latent
dimensionality L 5, 10, 15, 20, 25
  • Gaussian-SDR achieves overall best performance
  • SDR matches SVMs performance using only 5
    dimensions, while SVDM needs 15
  • SDR greatly outperforms unsupervised DR followed
    by learning a classifier
  • Optimal regularization parameter 0.0001 for SDR,
    0.001 for SVDM

24
Real-life Data PBAIC 2007 fMRI dataset
Real-valued data, Regression Task
  • 3 subjects playing a virtual-reality videogame,
    3 times (runs) each
  • fMRI data 704 time points (TRs) and 30,000
    voxels
  • Runs rated on 24 continuous measurements (target
    variables to predict)
  • Annoyance, Anxiety, Presence of a danger (barking
    dog), Looking at a face, Hearing Instructions,
    etc.
  • Goal train on two runs, predict target variables
    given fMRI for 3rd run ( is subject listening
    to Instructions?)

SDR-regression was competitive with the
state-of-art Elastic Net sparse regression for
predicting Instructions, and outperformed EN
in low-dimensional regime
25
Conclusions
  • General SDR-GLM framework that handles
  • mixed data types (continuous and discrete)
  • linear and nonlinear mappings produced by
    Generalized Linear Models
  • multiple label types (classification and
    regression)
  • multitask learning (multiple prediction problems
    at once)
  • semi-supervised learning, arbitrary missing
    labels
  • Our model features and labels GLMs over shared
    hidden low-dim data
  • i.e., exponential-family distributions with
    low-dimensional natural parameters
  • Simple closed-form iterative algorithm
  • uses auxiliary functions (bounds) with
    easily-computed derivatives
  • short Matlab code, no need for optimization
    packages ?
  • runs fast (e.g., compared with SVDM version based
    on Sedumi)
  • Future work
  • Is there a general way to derive bounds (aux.
    functions) for arbitrary exp-family
    log-likelihood? (beyond Gaussian, Bernoulli and
    multinomial)
  • Combining various other DR methods (e.g., NMF)
    that also allow for closed-form updates
  • Evaluating SDR-GLM on multi-task and
    semi-supervised problems (e.g., PBAIC)

26
Exponential-Family Distributions
27
More Results on Real-life Data Classification
Tasks
Internet connectivity (PlanetLab) data (binary)
Mass-spectrometry (protein expression levels)
28
Principal Component Analysis (PCA)
PCA finds an orthogonal linear transformation
that maps the data to a new coordinate system
such that the greatest variance by any projection
of the data comes to lie on the first coordinate
(called the first principal component), the
second greatest variance on the second
coordinate, and so on. PCA is theoretically the
optimum transform for a given data in least
square terms.
29
Probabilistic View of PCA
Traditional view at PCA as a least-squares error
minimization
When is fixed, this is equivalent to
likelihood maximization with Gaussian model
Thus, traditional PCA effectively assumes
Gaussian distribution of the data Maximizing
Gaussian likelihood ? finding minimum Euclidean
distance projections
30
Generalization of PCA to Exponential-Family Noise
Write a Comment
User Comments (0)
About PowerShow.com