Prediction using Side Information - PowerPoint PPT Presentation

About This Presentation

Title:

Prediction using Side Information

Description:

Bin Yu Department of Statistics, UC Berkeley Joint work with Peng Zhao, Guilherme Rocha, and Vince Vu – PowerPoint PPT presentation

Number of Views:110

Avg rating:3.0/5.0

Slides: 57

Provided by: Gang156

Category:

more less

Transcript and Presenter's Notes

Title: Prediction using Side Information

1
Prediction using Side Information

Bin Yu
Department of Statistics, UC Berkeley
Joint work with
Peng Zhao, Guilherme Rocha, and
Vince Vu

2
Outline

Motivation
Background
Penalization methods (building in side
information through penalty)
L1-penalty (sparsity as the side information)
Group and hierarchy as side information
Composite Absolute Penalty (CAP)
Building Blocks L?-norm Regularization
Definition
Interpretation
Algorithms
Examples and Results
Unlabeled data as side information
semi-supervised learning
Motivating example image-fMRI problem
in neuroscience
Penalty based on population covariance
matrix
Theoretical result to compare with OLS
Experimental results on image-fMRI data

3
Characteristics of Modern Data Set Problems

Goal efficient use of data for
Prediction
Interpretation
Larger number of variables
Number of variables (p) in data sets is large
Sample sizes (n) have not increased at same pace
Scientific opportunities
New findings in different scientific fields

4
Regression and classification

Data
Example image-fMRI problem
Predictor 11,000
features of an image
Response (preprocessed)
fMRI signal at a voxe
n1750 samplesl
Minimization of an empirical loss (e.g. L2)
leads to
ill-posed computational problem, and
bad prediction

5
Regularization improves prediction

Penalization -- linked to computation
L2 (numerical stability ridge SVM)
Model selection (sparisty, combinatorial
search)
L1 (sparsity, convex optimization)
Early stopping tuning parameter is computational
Neural nets
Boosting
Hierarchical modeling (computational
considerations)

6
Lasso L1-norm as a penalty

The L1 penalty is defined for coefficients ??
Used initially with L2 loss
Signal processing Basis Pursuit (Chen
Donoho,1994)
Statistics Non-Negative Garrote (Breiman, 1995)
Statistics LASSO (Tibshirani, 1996)
Properties of Lasso
Sparsity (variable selection)
Convexity (convex relaxation of L0-penalty)

7
Lasso L1-norm as a penalty

Computation
the right tuning parameter unknown so
path is needed
(discretized or continuous)
Initially quadratic program for each a grid on
?. QP is called
for each ?.
Later path following algorithms
homotopy by Osborne et al (2000)
LARS by Efron et al (2004)
Theoretical studies much work recently on
Lasso

8
General Penalization Methods

Given data
Xi a p-dimensional predictor
Yi response variable
The parameters ? are defined by the penalized
problem
where
is the empirical loss function
is a penalty function
? is a tuning parameter

9
Beyond Sparsity of Individual PredictorsNatural
Structures among predictors

Rationale side information might be
available and/or
additional regularization is needed beyond
Lasso for pgtgtn
Groups
Genes belonging to the same pathway
Categorical variables represented by dummies
Polynomial terms from the same variable
Noisy measurements of the same variable.
Hierarchy
Multi-resolution/wavelet models
Interactions terms in factorial analysis (ANOVA)
Order selection in Markov Chain models

10
Composite Absolute Penalties (CAP)Overview

The CAP family of penalties
Highly customizable
ability to perform grouped selection
ability to perform hierarchical selection
Computational considerations
Feasibility Convexity
Efficiency Piecewise linearity in some cases
Define groups according to structure
Combine properties of L?-norm penalties
Encompass and go beyond existing works
Elastic Net (Zou Hastie, 2005)
GLASSO (Yuan Lin, 2006)
Blockwise Sparse Regression (Kim, Kim Kim,
2006)

11
Composite Absolute PenaltiesReview of L?
Regularization

Given data and loss
function
L? Regularization
Penalty
Estimate
where ?gt0 is a tuning parameter
For the squared error loss function
Hoerl Kennard (1970) Ridge (?2)
Frank Friedman (1993) Bridge (general ?)
LASSO (1996) (?1)
SCAD (Fan and Li, 1999) (?lt1)

12
Composite Absolute PenaltiesDefinition

The CAP parameter estimate is given by
Gk's, k1,,K - indices of k-th pre-defined group
?Gk corresponding vector of coefficients.
. ?k group L?k norm Nk ??k?k
. ?0 overall norm T(?) N?0
groups may overlap (hierarchical selection)

13
Composite Absolute PenaltiesA Bayesian
interpretation

For non-overlapping groups
Prior on group norms
Prior on individual coefficients

14
Composite Absolute PenaltiesGroup selection

Tailoring T(?) for group selection
Define non-overlapping groups
Set ?kgt1, for all k ? 0
Group norm ?k tunes similarity within its group
?kgt1 causes all variables in group i to be
included/excluded together
Set ?01
This yields grouped sparsity
?k2 has been studied by Yuan and Lin(Grouped
Lasso, 2005).

15
Composite Absolute PenaltiesHierarchical
Structures

Tailoring T(?) for Hierarchical Structure
Set ?01
Set ?igt1, ?i
Groups overlap
If??2 appears in all groups where ?1 is included
Then X2 enters the model after X1
As an example

16
Composite Absolute PenaltiesHierarchical
Structures

Represent Hierarchy by a directed graph
Then construct penalty by
For graph above, ?01, ?r?

17
Composite Absolute PenaltiesComputation

CAP with general L? norms
Approximate algorithms available for tracing
regularization path
Two examples
Rosset (2004)
Boosted Lasso (Zhao and Yu, 2004) BLASSO
CAP with L1L? norms
Exact algorithms for tracing regularization path
Some applications
Grouped Selection iCAP
Hierarchical Selection hiCAP for ANOVA and
wavelets

18
iCAPDegrees of Freedom (DFs) for tuning par.
selection

Two ways for selecting the tuning parameter in
iCAP
1. Cross-validation
2. Model selection criterion AIC_c
where DF used is a generalization of Zou et
al (2004)s df
for Lasso to iCAP.

19
Simulation Studies (pgtn) (partially adaptive
grouping)Summary of Results

Good prediction accuracy
Extra structure results in non-trivial reduction
of model error
Sparsity/Parsimony
Less sparse models in l0 sense
Sparser in terms of degrees of freedom
Estimated degrees of freedom (Group, iCAP only)
Good choices for regularization parameter
AICc model errors close to CV

20
ANOVA Hierarchical SelectionSimulation Setup

55 variables (10 main effects, 45 interactions)
121 observations
200 replications in results that follow

21
ANOVA Hierarchical SelectionModel Error
22
Summary on CAP Group and Hierarchical Sparsity

CAP penalties
Are built from L? blocks
Allow incorporation of different structures to
fitted model
Group of variables
Hierarchy among predictors
Algorithms
Approximation using BLASSO for general CAP
penalties
Exact and efficient for particular cases (L2
loss, L1 and L? norms)
Choice of regularization parameter ?
Cross-validation
AICc for particular cases (L2 loss, L1 and L?
norms)

23
Regularization using unlabeled data
semisupervised learning
Motivating example image-fMRI problem in
neuroscience (Gallant Lab at UCB)
Goal to understand how natural images
relate to fMRI signals
24
Stimuli

Natural image stimuli

25
Stimulus to response

Natural image stimuli drawn randomly from a
database of 11,499 images
Experiment designed so that response from
different presentations are nearly independent
Response is pre-processed and roughly Gaussian

26
Gabor wavelet pyramid
27
Features
28
Linear model

Separate linear model for each voxel
Y Xb e
Model fitting
X p10921 dimensions (features)
n 1750 training samples
Fitted model tested on 120 validation samples
Performance measured by correlation

29
Ordinary Least Squares (OLS)

Minimize empirical squared error risk
Notice that OLS estimate is a function of
estimates of covariance of X (Sxx) and covariance
X with Y (Sxy)

30
OLS

Sample covariance matrix of X is often nearly
singular and so inversion is ill-posed.
Some existing solutions
Ridge regression
Pseudo-inverse (or truncated SVD)
Lasso (closely related to L2boosting -- current
method at Gallant Lab)

31
Semi-supervised

Abundant unlabeled data available
samples from the marginal distribution of X
Book on semisupervised learning (2006) (eds.
Chapelle, Scholkopf, and Zien)
Stat. science article (2007) (Liang, Mukherjee
and Westl)
Image-fMRI images in the database are unlabeled
data
Semi-supervised linear regression
Use
labeled (Xi,Yi) i1,, n, and
unlabeled data Xi in1,,nm to fit

32
Semi-supervised

Does marginal distribution of X play a role?
For fixed design X, marginal dist of X plays no
role.
(Brown 1990) shows that OLS estimate of the
intercept is inadmissible if X assumed random.

33
Refining OLS

The unknown parameter satisfies
So OLS can be seen as a plug-in estimate for this
equation
Can plug-in an improved estimate of Sxx ?

34
A first approach

Suppose population covariance of X is known
(infinite amount of unlabeled data)
Use a linear combination of the sample and
population covariances.
(Ledoit and Wolf 2004) considered convex
combinations of sample covariance and another
matrix from a parametric model

35
Semi-supervised OLS
Plug in the improved estimate of Sxx, we get
semi-OLS
36
Semi-supervised OLS

Equivalent to penalized least squares
Equivalent to ridge regression in pre-whitened
covariates

37
Spectrally semi-supervised OLS

Ridge regression in (W,Y) is just a
transformation of ?, where W has spectral
decomposition
More generally, can consider arbitrary
transformations of the spectrum of W
Resulting estimator

38
Spectrally semi-supervised OLS

Examples
OLS
h(s) 1/s
Semi-OLS Ridge on pre-whitened predictors
h(s) 1/(sa)
Truncated SVD on pre-whitened predictors (PCA
reg)
h(s) 1/s if sgtc, otherwise 0

39
Large n,p asymptotic MSPE

Assumptions
S non-degenerate
Z X S-1/2 is n-by-p with IID entries
satisfying
mean 0, variance 1
finite 4th moment
h is a bounded function
ßT Sxx ß / s2 has finite limit SNR2 as p,n tend
to 8
p/n has finite, strictly positive limit r

40
Large n,p MSPE

Theorem
The Mean Squared Prediction Error satisfies
where Fr is the Marchenko-Pastur law with index r
and

41
Consequences

Asymptotically optimal h
Asymptotically better than OLS and truncated SVD
Reminiscent of shrinkage factor in James-Stein
estimate
SNR might be easily estimated

42
Back to image-fMRi problem

Fitting details
Regularization parameter selected by 5-fold cross
validation
L2 boosting applied to all 10,000 features
-- L2 boosting is the method of choice
in Gallant Lab
Other methods applied to 500 features
pre-selected by correlation

43
Other methods

k 1 semi OLS (theoretically better than OLS)
k 0 ridge
k -1 semi OLS (inverse)

44
Validation correlation
voxel OLS L2 boost Semi OLS Ridge Semi OLS inverse (k-1)
1 0.434 0.738 0.572 0.775 0.773
2 0.228 0.731 0.518 0.725 0.731
3 0.320 0.754 0.607 0.741 0.754
4 0.303 0.764 0.549 0.742 0.737
5 0.357 0.742 0.544 0.763 0.763
6 0.325 0.696 0.520 0.726 0.734
7 0.427 0.739 0.579 0.728 0.736
8 0.512 0.741 0.608 0.747 0.743
9 0.247 0.746 0.571 0.754 0.753
10 0.458 0.751 0.606 0.761 0.758
11 0.288 0.778 0.610 0.801 0.801
45
Features used by L2boost
Features used by L2boosting
46
500 features used by semi-methods
47
Comparison of the feature locations
Semi methods
L2boost
48
Further work

Image-fMRI problem based on a linear model
Compare methods for other voxels
Use fewer features for
semi-methods?
(average features for
L2boosting 120
features for
semi-methods 500, by design)
Interpretation of the results of
different methods
Theoretical results for ridge and semi inverse
OLS?
Image-fMRI problem non-linear modeling
understanding the image space
(clusters? Manifolds?)
different linear models on
different clusters (manifolds)?
non-linear models on different
clusters (manifolds)?

CAP Codes www.stat.berkeley.edu/yugroup
Paper www.stat.berkeley.edu/binyu
to appear in Annals of
Statistics
Thanks Gallant Lab at UC Berkeley

50
Proof Ingredients

Can show that MSPE decomposes as
Results in random matrix theory can be applied
BIAS term is a quadratic form in sample
covariance matrix
VARIANCE term is an integral wrt empirical
spectral distribution of sample covariance matrix

51
Proof Ingredients

For bounded g and unit p-vector v
1.
2.
(1) shown in (Silverstein 1989) and strengthened
in (Bai, Miao, and Pan 2007)
(2) is Marchenko-Pastur result, strongest version
in (Bai and Silverstein 1995)

52
Grouping examplesCase 1 Settings

Goals
Comparison of different group norms
Comparison of CV against AICC
Y X? ??
Settings

? Coefficient Profile
53
Grouping exampleCase 1 LASSO vs. iCAP sample
paths
LASSO path
Number of steps
Normalized coefficients
iCAP path
Number of steps
54
Grouping exampleCase 1 Comparison of norms and
clusterings
10 fold CV Model error
55
ANOVA Hierarchical SelectionNumber of Selected
Terms
56
(No Transcript)

Write a Comment

User Comments (0)