Title: Model%20Selection%20via%20Bilevel%20Optimization
1Model Selection via Bilevel Optimization
- Kristin P. Bennett, Jing Hu, Xiaoyun Ji,
- Gautam Kunapuli and Jong-Shi Pang
- Department of Mathematical Sciences
- Rensselaer Polytechnic Institute
- Troy, NY
2Convex Machine Learning
- Convex optimization approaches to machine
learning has been major obsession of machine
learning for last ten years. - But are the problems really convex?
3Outline
- The myth of convex machine learning
- Bilevel Programming Model Selection
- Regression
- Classification
- Extensions to other machine learning tasks
- Discussion
4Modelers Choices
- Data
-
- Function
-
- Loss/Regularization
-
CONVEX!
Optimization Algorithm
w
5Many Hidden Choices
- Data
- Variable Selection
- Scaling
- Feature Construction
- Missing Data
- Outlier removal
- Function Family
- linear, kernel (introduces kernel parameters)
- Optimization model
- loss function
- regularization
- Parameters/Constraints
6- Data
- Function
- Loss/Regularization
- Cross-Validation Strategy
-
- Generalization Error
-
NONCONVEX
7How does modeler make choices?
- Best training set error
- Experience/policy
- Estimate of generalization error
- Cross-validation
- Bounds
- Optimize generalization error estimate
- Fiddle around.
- Grid Search
- Gradient methods
- Bilevel Programming
8Splitting Data for T-fold CV
9CV via Grid Search
- For every C, e
- For every validation set,
- Solve model on corresp. training set,
and -
- to estimate loss for
- Estimate generalization error for C, e
- Return best values for
- C,e
- Make final model using
- C,e
10CV as Continuous Optimization Problem
- Bilevel Program for T folds
- Prior Approaches Golub et al., 1979,
- Generalized Cross-Validation for one
parameter in - Ridge Regression
Outer-level validation problem
T inner-level training problems
11Benefit More Design Variables
Add feature box constraint
in the inner-level problems.
12 -insensitive Loss Function
13Inner-level Problem for t-th Fold
14Optimality (KKT) conditions for fixed
15Key Transformation
- KKT for the inner level training problems are
necessary and sufficient - Replace lower level problems by their KKT
Conditions - Problem becomes an Mathematical Programming
Problem with Equilibrium Constraints (MPEC)
16Bilevel Problem as MPEC
Replace T inner-level problems with corresponding
optimality conditions
17MPEC to NLP via Inexact Cross Validation
- Relax hard equilibrium constraints
-
- to soft inexact constraints
- tol is some user-defined tolerance.
18Solvers
- Strategy Proof of concept using nonlinear
general purpose solvers from NEOS on NLP - FILTER, SNOPT
- Sequential Quadratic Programming
Methods - FILTER results almost always better.
- Many possible alternatives
- Integer Programming
- Branch and Bound
- Lagrangian Relaxations
19Computational Experiments DATA
- Synthetic
- (5,10,15)-D Data with Gaussian and Laplacian
noise and (3,7,10) relevant features. - NLP 3-fold CV
- Results 30 to 90 train, 1000 test points, 10
trials - QSAR/Drug Design
- 4 datasets, 600 dimensions reduced to 25
top principal components. NLP 5-fold CV - Results 40 100 train, rest test, 20
trials
20Cross-validation Methods Compared
- Unconstrained Grid
- Try 3 values each for C,e
- Constrained Grid
- Try 3 values each for C, e, and
- 0, 1 for each component of
- Bilevel/FILTER Nonlinear program solved using
off-the-shelf SQP algorithm via NEOS
2115-D Data Objective Value
2215-D Data Computational Time
2315-D Data TEST MAD
24QSAR Data Objective Value
25QSAR Data Computation Time
26QSAR Data TEST MAD
27Classification Cross Validation
- Given sample data from two classes.
- Find classification function that minimizes
out-of-sample estimate of classification error
1
-1
28Lower level - SVM
- Define parallel planes
- Minimize points on wrong side
- Maximize margin of separation
29Lower Level Loss Function Hinge Loss
Measures distance of points that violate the
appropriate hyperplane constraints,
30Lower Level Problem SVC with box
31Inner-level KKT Conditions
32Outer-level Loss Functions
- Misclassification Minimization Loss (MM)
- Loss function used in classical CV
- Loss 1, if validation pt misclassified,
0, otherwise(computed using step function,
) - Hinge Loss (HL)
- Both inner and outer levels use same loss
function - Loss distance from(computed using max
function, )
33Hinge Loss is Convex Approx. of Misclassification
Minimization Loss
34Hinge Loss Bilevel Program (BilevelHL)
- Replace max in outer level objective with convex
constraints - Replace inner-level problems with KKT conditions
35 Hinge
Loss MPEC
36Misclassification Min. Bilevel Program (BilevelMM)
Misclassifications are counted using the step
function, defined component wise for a n-vector as
37The Step Function
- Mangasarian (1994) showed that
- and that any solution, , to the LP
satisfies
38Misclassifications in the Validation Set
- Validation point misclassified when the sign of
is negative i.e., - This can be recast for all validation points
(within the t-th fold) as
39Misclassification Minimization Bilevel Program
(revisited)
Outer-level average misclassification minimization
Inner-level problems to determine misclassified
validation points
Inner-level training problems
40 Misclassification Minimization MPEC
41Inexact Cross Validation NLP
- Both BilevelHL and BilevelMM MPECs are
transformed to NLP by relaxing equilibrium
constraints (inexact CV) - Solved using FILTER on NEOS
- These are compared with classical cross
validation unconstrained and constrained grid.
42Experiments Data sets
- 3-fold cross validation for model selection
- Average results for 20 train test splits
43Computational Time
44Training CV Error
45Testing Error
46Number of Variables
47Progress
- Cross Validation is a bilevel problem solvable by
continuous optimization methods - Off-the-shelf NLP algorithm FILTER solved
classification and regression - Bilevel Optimization extendable to many Machine
Learning problems
48Extending Bilevel Approach to other Machine
Learning Problems
- Kernel Classification/Regression
- Variable Selection/Scaling
- Multi-task Learning
- Semi-supervised Learning
- Generative methods
49Semi-supervised Learning
- Have labeled data, and
unlabeled data - Treat missing labels, , as design variables in
the outer level - Lower level problems are still convex
50Semi-supervised Regression
Outer level minimizes error on labeled data to
find optimal parameters and labels
-insensitive loss on labeled data in inner
level
-insensitive loss on unlabeled data in inner
level
Inner level regularization
51Discussion
- New capacity offers new possibilities
- Outer level objectives?
- Inner level problem?
- classification, ranking, semi-supervised,
- missing values, kernel selection, variable
selection, - Need special purpose algorithms for greater
efficiency, scalability, robustness
This work was supported by Office of Naval
Research Grant N00014-06-1-0014.
52Experiments Bilevel CV Procedure
- Run BilevelMM/BilevelHL to computeoptimal
parameters, - Drop descriptors with small
- Create model on all training data using
- Compute test error on hold-out set
-
53Experiments Grid Search CV Procedure
- Unconstrained Grid
- Try 6 values for on a log10 scale
- Constrained Grid
- Try 6 values for and 0, 1 for each
component of (perform RFE if necessary) - Create model on all training data using
optimal grid point - Compute test error on hold-out set
-
54Extending Bilevel Approach to other Machine
Learning Problems
- Kernel Classification/Regression
- Different Regularizations (L1, elastic nets)
- Enhanced Feature Selection
- Multi-task Learning
- Semi-supervised Learning
- Generative methods
55Enhanced Feature Selection
- Assume at most descriptors allowed
- Introduce outer-level constraint(with
counting the non-zero elements of ) - Rewrite constraint, observing that
- Get additional conditions for
-
-
56Kernel Bilevel Discussion
- Pros
- Performs model selection in feature space
- Performs feature selection in input space
- Cons
- Highly nonlinear model
- Difficult to solve
57Kernel Classification (MPEC form)
58Is it okay to do 3-folds
59Applying the kernel trick
- Drop the box constraint,
- Eliminate from the optimality conditions
- Replace with an appropriate
60Feature Selection with Kernels
- Parameterize the kernel withsuch that if
the n-th descriptor vanishes from - Linear kernel
- Polynomial kernel
- Gaussian kernel
61Kernel Regression (Bilevel form)