Model%20Selection%20via%20Bilevel%20Optimization - PowerPoint PPT Presentation

About This Presentation
Title:

Model%20Selection%20via%20Bilevel%20Optimization

Description:

Loss/Regularization. Cross-Validation Strategy. Generalization Error ... Inner level regularization. Discussion. New capacity offers new possibilities: ... – PowerPoint PPT presentation

Number of Views:162
Avg rating:3.0/5.0
Slides: 62
Provided by: kristinp9
Category:

less

Transcript and Presenter's Notes

Title: Model%20Selection%20via%20Bilevel%20Optimization


1
Model Selection via Bilevel Optimization
  • Kristin P. Bennett, Jing Hu, Xiaoyun Ji,
  • Gautam Kunapuli and Jong-Shi Pang
  • Department of Mathematical Sciences
  • Rensselaer Polytechnic Institute
  • Troy, NY

2
Convex Machine Learning
  • Convex optimization approaches to machine
    learning has been major obsession of machine
    learning for last ten years.
  • But are the problems really convex?

3
Outline
  • The myth of convex machine learning
  • Bilevel Programming Model Selection
  • Regression
  • Classification
  • Extensions to other machine learning tasks
  • Discussion

4
Modelers Choices
  • Data
  • Function
  • Loss/Regularization

CONVEX!
Optimization Algorithm
w
5
Many Hidden Choices
  • Data
  • Variable Selection
  • Scaling
  • Feature Construction
  • Missing Data
  • Outlier removal
  • Function Family
  • linear, kernel (introduces kernel parameters)
  • Optimization model
  • loss function
  • regularization
  • Parameters/Constraints

6
  • Data
  • Function
  • Loss/Regularization
  • Cross-Validation Strategy
  • Generalization Error

NONCONVEX
7
How does modeler make choices?
  • Best training set error
  • Experience/policy
  • Estimate of generalization error
  • Cross-validation
  • Bounds
  • Optimize generalization error estimate
  • Fiddle around.
  • Grid Search
  • Gradient methods
  • Bilevel Programming

8
Splitting Data for T-fold CV
9
CV via Grid Search
  • For every C, e
  • For every validation set,
  • Solve model on corresp. training set,
    and
  • to estimate loss for
  • Estimate generalization error for C, e
  • Return best values for
  • C,e
  • Make final model using
  • C,e

10
CV as Continuous Optimization Problem
  • Bilevel Program for T folds
  • Prior Approaches Golub et al., 1979,
  • Generalized Cross-Validation for one
    parameter in
  • Ridge Regression

Outer-level validation problem
T inner-level training problems
11
Benefit More Design Variables
Add feature box constraint
in the inner-level problems.
12
-insensitive Loss Function
13
Inner-level Problem for t-th Fold
14
Optimality (KKT) conditions for fixed
15
Key Transformation
  • KKT for the inner level training problems are
    necessary and sufficient
  • Replace lower level problems by their KKT
    Conditions
  • Problem becomes an Mathematical Programming
    Problem with Equilibrium Constraints (MPEC)

16
Bilevel Problem as MPEC
Replace T inner-level problems with corresponding
optimality conditions
17
MPEC to NLP via Inexact Cross Validation
  • Relax hard equilibrium constraints
  • to soft inexact constraints
  • tol is some user-defined tolerance.

18
Solvers
  • Strategy Proof of concept using nonlinear
    general purpose solvers from NEOS on NLP
  • FILTER, SNOPT
  • Sequential Quadratic Programming
    Methods
  • FILTER results almost always better.
  • Many possible alternatives
  • Integer Programming
  • Branch and Bound
  • Lagrangian Relaxations

19
Computational Experiments DATA
  • Synthetic
  • (5,10,15)-D Data with Gaussian and Laplacian
    noise and (3,7,10) relevant features.
  • NLP 3-fold CV
  • Results 30 to 90 train, 1000 test points, 10
    trials
  • QSAR/Drug Design
  • 4 datasets, 600 dimensions reduced to 25
    top principal components. NLP 5-fold CV
  • Results 40 100 train, rest test, 20
    trials

20
Cross-validation Methods Compared
  • Unconstrained Grid
  • Try 3 values each for C,e
  • Constrained Grid
  • Try 3 values each for C, e, and
  • 0, 1 for each component of
  • Bilevel/FILTER Nonlinear program solved using
    off-the-shelf SQP algorithm via NEOS

21
15-D Data Objective Value
22
15-D Data Computational Time
23
15-D Data TEST MAD
24
QSAR Data Objective Value
25
QSAR Data Computation Time
26
QSAR Data TEST MAD
27
Classification Cross Validation
  • Given sample data from two classes.
  • Find classification function that minimizes
    out-of-sample estimate of classification error

1
-1
28
Lower level - SVM
  • Define parallel planes
  • Minimize points on wrong side
  • Maximize margin of separation

29
Lower Level Loss Function Hinge Loss
Measures distance of points that violate the
appropriate hyperplane constraints,
30
Lower Level Problem SVC with box
31
Inner-level KKT Conditions
32
Outer-level Loss Functions
  • Misclassification Minimization Loss (MM)
  • Loss function used in classical CV
  • Loss 1, if validation pt misclassified,
    0, otherwise(computed using step function,
    )
  • Hinge Loss (HL)
  • Both inner and outer levels use same loss
    function
  • Loss distance from(computed using max
    function, )

33
Hinge Loss is Convex Approx. of Misclassification
Minimization Loss
34
Hinge Loss Bilevel Program (BilevelHL)
  • Replace max in outer level objective with convex
    constraints
  • Replace inner-level problems with KKT conditions

35
Hinge
Loss MPEC
36
Misclassification Min. Bilevel Program (BilevelMM)
Misclassifications are counted using the step
function, defined component wise for a n-vector as
37
The Step Function
  • Mangasarian (1994) showed that
  • and that any solution, , to the LP
    satisfies

38
Misclassifications in the Validation Set
  • Validation point misclassified when the sign of
    is negative i.e.,
  • This can be recast for all validation points
    (within the t-th fold) as

39
Misclassification Minimization Bilevel Program
(revisited)
Outer-level average misclassification minimization
Inner-level problems to determine misclassified
validation points
Inner-level training problems
40
Misclassification Minimization MPEC
41
Inexact Cross Validation NLP
  • Both BilevelHL and BilevelMM MPECs are
    transformed to NLP by relaxing equilibrium
    constraints (inexact CV)
  • Solved using FILTER on NEOS
  • These are compared with classical cross
    validation unconstrained and constrained grid.

42
Experiments Data sets
  • 3-fold cross validation for model selection
  • Average results for 20 train test splits

43
Computational Time
44
Training CV Error
45
Testing Error
46
Number of Variables
47
Progress
  • Cross Validation is a bilevel problem solvable by
    continuous optimization methods
  • Off-the-shelf NLP algorithm FILTER solved
    classification and regression
  • Bilevel Optimization extendable to many Machine
    Learning problems

48
Extending Bilevel Approach to other Machine
Learning Problems
  • Kernel Classification/Regression
  • Variable Selection/Scaling
  • Multi-task Learning
  • Semi-supervised Learning
  • Generative methods

49
Semi-supervised Learning
  • Have labeled data, and
    unlabeled data
  • Treat missing labels, , as design variables in
    the outer level
  • Lower level problems are still convex

50
Semi-supervised Regression
Outer level minimizes error on labeled data to
find optimal parameters and labels
-insensitive loss on labeled data in inner
level
-insensitive loss on unlabeled data in inner
level
Inner level regularization
51
Discussion
  • New capacity offers new possibilities
  • Outer level objectives?
  • Inner level problem?
  • classification, ranking, semi-supervised,
  • missing values, kernel selection, variable
    selection,
  • Need special purpose algorithms for greater
    efficiency, scalability, robustness

This work was supported by Office of Naval
Research Grant N00014-06-1-0014.
52
Experiments Bilevel CV Procedure
  • Run BilevelMM/BilevelHL to computeoptimal
    parameters,
  • Drop descriptors with small
  • Create model on all training data using
  • Compute test error on hold-out set

53
Experiments Grid Search CV Procedure
  • Unconstrained Grid
  • Try 6 values for on a log10 scale
  • Constrained Grid
  • Try 6 values for and 0, 1 for each
    component of (perform RFE if necessary)
  • Create model on all training data using
    optimal grid point
  • Compute test error on hold-out set

54
Extending Bilevel Approach to other Machine
Learning Problems
  • Kernel Classification/Regression
  • Different Regularizations (L1, elastic nets)
  • Enhanced Feature Selection
  • Multi-task Learning
  • Semi-supervised Learning
  • Generative methods

55
Enhanced Feature Selection
  • Assume at most descriptors allowed
  • Introduce outer-level constraint(with
    counting the non-zero elements of )
  • Rewrite constraint, observing that
  • Get additional conditions for

56
Kernel Bilevel Discussion
  • Pros
  • Performs model selection in feature space
  • Performs feature selection in input space
  • Cons
  • Highly nonlinear model
  • Difficult to solve

57
Kernel Classification (MPEC form)
58
Is it okay to do 3-folds
59
Applying the kernel trick
  • Drop the box constraint,
  • Eliminate from the optimality conditions
  • Replace with an appropriate

60
Feature Selection with Kernels
  • Parameterize the kernel withsuch that if
    the n-th descriptor vanishes from
  • Linear kernel
  • Polynomial kernel
  • Gaussian kernel

61
Kernel Regression (Bilevel form)
Write a Comment
User Comments (0)
About PowerShow.com