Title: Prediction using Side Information
1Prediction using Side Information
- Bin Yu
- Department of Statistics, UC Berkeley
- Joint work with
- Peng Zhao, Guilherme Rocha, and
- Vince Vu
2Outline
- Motivation
- Background
- Penalization methods (building in side
information through penalty) - L1-penalty (sparsity as the side information)
- Group and hierarchy as side information
- Composite Absolute Penalty (CAP)
- Building Blocks L?-norm Regularization
- Definition
- Interpretation
- Algorithms
- Examples and Results
- Unlabeled data as side information
semi-supervised learning - Motivating example image-fMRI problem
in neuroscience - Penalty based on population covariance
matrix - Theoretical result to compare with OLS
- Experimental results on image-fMRI data
3Characteristics of Modern Data Set Problems
- Goal efficient use of data for
- Prediction
- Interpretation
- Larger number of variables
- Number of variables (p) in data sets is large
- Sample sizes (n) have not increased at same pace
- Scientific opportunities
- New findings in different scientific fields
4Regression and classification
- Data
- Example image-fMRI problem
- Predictor 11,000
features of an image - Response (preprocessed)
fMRI signal at a voxe - n1750 samplesl
- Minimization of an empirical loss (e.g. L2)
leads to -
- ill-posed computational problem, and
- bad prediction
-
5Regularization improves prediction
- Penalization -- linked to computation
- L2 (numerical stability ridge SVM)
- Model selection (sparisty, combinatorial
search) - L1 (sparsity, convex optimization)
- Early stopping tuning parameter is computational
- Neural nets
- Boosting
- Hierarchical modeling (computational
considerations)
6Lasso L1-norm as a penalty
- The L1 penalty is defined for coefficients ??
- Used initially with L2 loss
- Signal processing Basis Pursuit (Chen
Donoho,1994) - Statistics Non-Negative Garrote (Breiman, 1995)
- Statistics LASSO (Tibshirani, 1996)
- Properties of Lasso
- Sparsity (variable selection)
- Convexity (convex relaxation of L0-penalty)
7Lasso L1-norm as a penalty
- Computation
- the right tuning parameter unknown so
path is needed - (discretized or continuous)
- Initially quadratic program for each a grid on
?. QP is called - for each ?.
- Later path following algorithms
- homotopy by Osborne et al (2000)
- LARS by Efron et al (2004)
-
- Theoretical studies much work recently on
Lasso -
8General Penalization Methods
- Given data
- Xi a p-dimensional predictor
- Yi response variable
- The parameters ? are defined by the penalized
problem - where
- is the empirical loss function
- is a penalty function
- ? is a tuning parameter
9Beyond Sparsity of Individual PredictorsNatural
Structures among predictors
- Rationale side information might be
available and/or - additional regularization is needed beyond
Lasso for pgtgtn - Groups
- Genes belonging to the same pathway
- Categorical variables represented by dummies
- Polynomial terms from the same variable
- Noisy measurements of the same variable.
- Hierarchy
- Multi-resolution/wavelet models
- Interactions terms in factorial analysis (ANOVA)
- Order selection in Markov Chain models
10Composite Absolute Penalties (CAP)Overview
- The CAP family of penalties
- Highly customizable
- ability to perform grouped selection
- ability to perform hierarchical selection
- Computational considerations
- Feasibility Convexity
- Efficiency Piecewise linearity in some cases
- Define groups according to structure
- Combine properties of L?-norm penalties
- Encompass and go beyond existing works
- Elastic Net (Zou Hastie, 2005)
- GLASSO (Yuan Lin, 2006)
- Blockwise Sparse Regression (Kim, Kim Kim,
2006)
11Composite Absolute PenaltiesReview of L?
Regularization
- Given data and loss
function - L? Regularization
- Penalty
- Estimate
- where ?gt0 is a tuning parameter
- For the squared error loss function
- Hoerl Kennard (1970) Ridge (?2)
- Frank Friedman (1993) Bridge (general ?)
- LASSO (1996) (?1)
- SCAD (Fan and Li, 1999) (?lt1)
12Composite Absolute PenaltiesDefinition
- The CAP parameter estimate is given by
- Gk's, k1,,K - indices of k-th pre-defined group
- ?Gk corresponding vector of coefficients.
- . ?k group L?k norm Nk ??k?k
- . ?0 overall norm T(?) N?0
- groups may overlap (hierarchical selection)
13Composite Absolute PenaltiesA Bayesian
interpretation
- For non-overlapping groups
- Prior on group norms
- Prior on individual coefficients
14Composite Absolute PenaltiesGroup selection
- Tailoring T(?) for group selection
- Define non-overlapping groups
- Set ?kgt1, for all k ? 0
- Group norm ?k tunes similarity within its group
- ?kgt1 causes all variables in group i to be
included/excluded together - Set ?01
- This yields grouped sparsity
- ?k2 has been studied by Yuan and Lin(Grouped
Lasso, 2005).
15Composite Absolute PenaltiesHierarchical
Structures
- Tailoring T(?) for Hierarchical Structure
- Set ?01
- Set ?igt1, ?i
- Groups overlap
- If??2 appears in all groups where ?1 is included
- Then X2 enters the model after X1
- As an example
16Composite Absolute PenaltiesHierarchical
Structures
- Represent Hierarchy by a directed graph
- Then construct penalty by
- For graph above, ?01, ?r?
17Composite Absolute PenaltiesComputation
- CAP with general L? norms
- Approximate algorithms available for tracing
regularization path - Two examples
- Rosset (2004)
- Boosted Lasso (Zhao and Yu, 2004) BLASSO
- CAP with L1L? norms
- Exact algorithms for tracing regularization path
- Some applications
- Grouped Selection iCAP
- Hierarchical Selection hiCAP for ANOVA and
wavelets
18iCAPDegrees of Freedom (DFs) for tuning par.
selection
- Two ways for selecting the tuning parameter in
iCAP - 1. Cross-validation
- 2. Model selection criterion AIC_c
- where DF used is a generalization of Zou et
al (2004)s df - for Lasso to iCAP.
19Simulation Studies (pgtn) (partially adaptive
grouping)Summary of Results
- Good prediction accuracy
- Extra structure results in non-trivial reduction
of model error - Sparsity/Parsimony
- Less sparse models in l0 sense
- Sparser in terms of degrees of freedom
- Estimated degrees of freedom (Group, iCAP only)
- Good choices for regularization parameter
- AICc model errors close to CV
20ANOVA Hierarchical SelectionSimulation Setup
- 55 variables (10 main effects, 45 interactions)
- 121 observations
- 200 replications in results that follow
21ANOVA Hierarchical SelectionModel Error
22Summary on CAP Group and Hierarchical Sparsity
- CAP penalties
- Are built from L? blocks
- Allow incorporation of different structures to
fitted model - Group of variables
- Hierarchy among predictors
- Algorithms
- Approximation using BLASSO for general CAP
penalties - Exact and efficient for particular cases (L2
loss, L1 and L? norms) - Choice of regularization parameter ?
- Cross-validation
- AICc for particular cases (L2 loss, L1 and L?
norms)
23Regularization using unlabeled data
semisupervised learning
Motivating example image-fMRI problem in
neuroscience (Gallant Lab at UCB)
Goal to understand how natural images
relate to fMRI signals
24Stimuli
25Stimulus to response
- Natural image stimuli drawn randomly from a
database of 11,499 images - Experiment designed so that response from
different presentations are nearly independent - Response is pre-processed and roughly Gaussian
26Gabor wavelet pyramid
27Features
28Linear model
- Separate linear model for each voxel
- Y Xb e
- Model fitting
- X p10921 dimensions (features)
- n 1750 training samples
- Fitted model tested on 120 validation samples
- Performance measured by correlation
29Ordinary Least Squares (OLS)
- Minimize empirical squared error risk
- Notice that OLS estimate is a function of
estimates of covariance of X (Sxx) and covariance
X with Y (Sxy)
30OLS
- Sample covariance matrix of X is often nearly
singular and so inversion is ill-posed. - Some existing solutions
- Ridge regression
- Pseudo-inverse (or truncated SVD)
- Lasso (closely related to L2boosting -- current
method at Gallant Lab)
31Semi-supervised
- Abundant unlabeled data available
- samples from the marginal distribution of X
- Book on semisupervised learning (2006) (eds.
Chapelle, Scholkopf, and Zien) - Stat. science article (2007) (Liang, Mukherjee
and Westl) - Image-fMRI images in the database are unlabeled
data - Semi-supervised linear regression
- Use
- labeled (Xi,Yi) i1,, n, and
- unlabeled data Xi in1,,nm to fit
32Semi-supervised
- Does marginal distribution of X play a role?
- For fixed design X, marginal dist of X plays no
role. - (Brown 1990) shows that OLS estimate of the
intercept is inadmissible if X assumed random.
33Refining OLS
- The unknown parameter satisfies
- So OLS can be seen as a plug-in estimate for this
equation - Can plug-in an improved estimate of Sxx ?
34A first approach
- Suppose population covariance of X is known
- (infinite amount of unlabeled data)
- Use a linear combination of the sample and
population covariances. - (Ledoit and Wolf 2004) considered convex
combinations of sample covariance and another
matrix from a parametric model
35Semi-supervised OLS
Plug in the improved estimate of Sxx, we get
semi-OLS
36Semi-supervised OLS
- Equivalent to penalized least squares
- Equivalent to ridge regression in pre-whitened
covariates
37Spectrally semi-supervised OLS
- Ridge regression in (W,Y) is just a
transformation of ?, where W has spectral
decomposition - More generally, can consider arbitrary
transformations of the spectrum of W - Resulting estimator
38Spectrally semi-supervised OLS
- Examples
- OLS
- h(s) 1/s
- Semi-OLS Ridge on pre-whitened predictors
- h(s) 1/(sa)
- Truncated SVD on pre-whitened predictors (PCA
reg) -
- h(s) 1/s if sgtc, otherwise 0
39Large n,p asymptotic MSPE
- Assumptions
- S non-degenerate
- Z X S-1/2 is n-by-p with IID entries
satisfying - mean 0, variance 1
- finite 4th moment
- h is a bounded function
- ßT Sxx ß / s2 has finite limit SNR2 as p,n tend
to 8 - p/n has finite, strictly positive limit r
40Large n,p MSPE
- Theorem
- The Mean Squared Prediction Error satisfies
- where Fr is the Marchenko-Pastur law with index r
and
41Consequences
- Asymptotically optimal h
- Asymptotically better than OLS and truncated SVD
- Reminiscent of shrinkage factor in James-Stein
estimate - SNR might be easily estimated
42Back to image-fMRi problem
- Fitting details
- Regularization parameter selected by 5-fold cross
validation - L2 boosting applied to all 10,000 features
- -- L2 boosting is the method of choice
in Gallant Lab - Other methods applied to 500 features
pre-selected by correlation
43Other methods
- k 1 semi OLS (theoretically better than OLS)
- k 0 ridge
- k -1 semi OLS (inverse)
44Validation correlation
voxel OLS L2 boost Semi OLS Ridge Semi OLS inverse (k-1)
1 0.434 0.738 0.572 0.775 0.773
2 0.228 0.731 0.518 0.725 0.731
3 0.320 0.754 0.607 0.741 0.754
4 0.303 0.764 0.549 0.742 0.737
5 0.357 0.742 0.544 0.763 0.763
6 0.325 0.696 0.520 0.726 0.734
7 0.427 0.739 0.579 0.728 0.736
8 0.512 0.741 0.608 0.747 0.743
9 0.247 0.746 0.571 0.754 0.753
10 0.458 0.751 0.606 0.761 0.758
11 0.288 0.778 0.610 0.801 0.801
45Features used by L2boost
Features used by L2boosting
46500 features used by semi-methods
47Comparison of the feature locations
Semi methods
L2boost
48Further work
- Image-fMRI problem based on a linear model
- Compare methods for other voxels
- Use fewer features for
semi-methods? - (average features for
L2boosting 120 - features for
semi-methods 500, by design) - Interpretation of the results of
different methods - Theoretical results for ridge and semi inverse
OLS? - Image-fMRI problem non-linear modeling
- understanding the image space
(clusters? Manifolds?) - different linear models on
different clusters (manifolds)? - non-linear models on different
clusters (manifolds)? -
-
49- CAP Codes www.stat.berkeley.edu/yugroup
- Paper www.stat.berkeley.edu/binyu
- to appear in Annals of
Statistics - Thanks Gallant Lab at UC Berkeley
50Proof Ingredients
- Can show that MSPE decomposes as
- Results in random matrix theory can be applied
- BIAS term is a quadratic form in sample
covariance matrix - VARIANCE term is an integral wrt empirical
spectral distribution of sample covariance matrix
51Proof Ingredients
- For bounded g and unit p-vector v
- 1.
- 2.
- (1) shown in (Silverstein 1989) and strengthened
in (Bai, Miao, and Pan 2007) - (2) is Marchenko-Pastur result, strongest version
in (Bai and Silverstein 1995)
52Grouping examplesCase 1 Settings
- Goals
- Comparison of different group norms
- Comparison of CV against AICC
- Y X? ??
- Settings
? Coefficient Profile
53Grouping exampleCase 1 LASSO vs. iCAP sample
paths
LASSO path
Number of steps
Normalized coefficients
iCAP path
Number of steps
54Grouping exampleCase 1 Comparison of norms and
clusterings
10 fold CV Model error
55ANOVA Hierarchical SelectionNumber of Selected
Terms
56(No Transcript)