Title: Feature Selection in Nonlinear Kernel Classification
1Feature Selection in Nonlinear Kernel
Classification
Workshop on Optimization-Based Data Mining
Techniques with Applications IEEE International
Conference on Data Mining Omaha, Nebraska,
October 28, 2007
- Olvi Mangasarian Edward Wild
- University of Wisconsin
- Madison
2Example
However, data is nonlinearly separable using only
the feature x2
Best linear classifier that uses only 1 feature
selects the feature x1
x2
Data is nonlinearly separable In general
nonlinear kernels use both x1 and x2
_
_
_
_
_
_
_
_
x1
Feature selection in nonlinear classification is
important
3Outline
- Minimize the number of input space features
selected by a nonlinear kernel classifier - Start with a standard 1-norm nonlinear support
vector machine (SVM) - Add 0-1 diagonal matrix to suppress or keep
features - Leads to a nonlinear mixed-integer program
- Introduce algorithm to obtain a good local
solution to the resulting mixed-integer program - Evaluate algorithm on two public datasets from
the UCI repository and synthetic NDCC data
4Support Vector Machines
Linear kernel (K(A, B))ij (AB)ij AiBj
K(Ai, Bj) Gaussian kernel, parameter ? (K(A,
B))ij exp(-?Ai0-Bj2)
SVMs
- x 2 Rn
- SVM defined by parameters u and threshold ? of
the nonlinear surface - A contains all data points
- ½ A
- ?? ½ A?
- e is a vector of ones
K(A, A0)u e? e
K(A?, A0)u e? ?e
Minimize e0y (hinge loss or plus function or
max, 0) to fit data
Minimize e0s (u1 at solution) to reduce
overfitting
K(x0, A0)u ????
K(x0, A0)u ??
Slack variable y 0 allows points to be on the
wrong side of the bounding surface
K(x0, A0)u ??1?
5Reduced Feature SVM
- To suppress features, add the number of features
present (e0Ee) to the objective with weight ? 0
- As ? is increased, more features will be removed
from the classifier
Start with Full SVM
Replace A with AE, where E is a diagonal n n
matrix with Eii 2 1, 0, i 1, , n
All features are present in the kernel matrix
K(A, A0)
If Eii is 0 the ith feature is removed
6Reduced Feature SVM (RFSVM)
- Initialize diagonal matrix E randomly
- For fixed 0-1 values E, solve the SVM linear
program to obtain (u, ?, y, s) - 3)Fix (u, ?, s) and sweep through E repeatedly as
follows - For each component of E replace 1 by 0 and
conversely provided the change decreases the
overall objective function by more than tol - 4)Go to (3) if a change was made in the last
sweep, otherwise continue to (5) - 5)Solve the SVM linear program with the new
matrix E. If the objective decrease is less than
tol, stop, otherwise go to (3)
7RFSVM Convergence (for tol 0)
- Objective function value converges
- Each step decreases the objective
- Objective is bounded below by 0
- Limit of the objective function value is attained
at any accumulation point of the sequence of
iterates - Accumulation point is a local minimum solution
- Continuous variables are optimal for the fixed
integer variables - Changing any single integer variable will not
decrease the objective
8Experimental Results
- Classification accuracy versus number of features
used - Compare our RFSVM to Relief and RFE
(Recursive Feature Elimination) - Results given on two public datasets from the UCI
repository - Ability of RFSVM to handle problems with up to
1000 features tested on synthetic NDCC datasets - Set feature selection parameter ?? 1
9Relief and RFE
- Relief
- Kira and Rendell, 1992
- Filter method feature selection is a
preprocessing procedure - Features are selected as relevant if they tend to
have different feature values for points in
different classes - RFE (Recursive Feature Elimination)
- Guyon, Weston, Barnhill, and Vapnik, 2002
- Wrapper method feature selection is based on
classification - Features are selected as relevant if removing
them causes a large change in the margin of an SVM
10Ionosphere Dataset 351 Points in R34
If the appropriate value of ? is selected, RFSVM
can obtain higher accuracy using fewer features
than SVM1
Nonlinear SVM with no feature selection
?
Even for feature selection parameter ? 0, some
features may be removed when removing them
decreases the hinge loss
Cross-validation accuracy
Note that accuracy decreases slightly until
about 10 features remain, and then decreases more
sharply as they are removed
Linear 1-norm SVM
Number of features used
11Normally Distributed Clusters on Cubes Dataset
(Thompson, 2006)
- Points are generated from normal distributions
centered at vertices of 1-norm cubes - Dataset is not linearly separable
12RFSVM vs. SVM without Feature Selection (NKSVM1)
on NDCC Data with 20 True Features and Varying
Numbers of Irrelevant Features
Each point is the average test set correctness
over 10 datasets with 200 training, 200 tuning,
and 1000 testing points
RFSVM vs. SVM without Feature Selection (NKSVM1)
on NDCC Data with 100 True Features and 1000
Irrelevant Features
When 480 irrelevant features are added, the
accuracy of RFSVM is 45 higher than that of
NKSVM1
13Conclusion
- New rigorous formulation with precise objective
for feature selection in nonlinear SVM
classifiers - Obtain a local solution to the resulting
mixed-integer program - Alternate between a linear program to compute
continuous variables and successive sweeps to
update the integer variables - Efficiently learns accurate nonlinear classifiers
with reduced numbers of features - Handles problems with 1000 features, 900 of which
are irrelevant
14Questions?
- Websites with links to papers and talks
- http//www.cs.wisc.edu/olvi
- http//www.cs.wisc.edu/wildt
- NDCC generator
- http//www.cs.wisc.edu/dmi/svm/ndcc/
15Running Time on the Ionosphere Dataset
- Averages 5.7 sweeps through the integer variables
- Averages 3.4 linear programs
- 75 of the time consumed in objective function
evaluations - 15 of time consumed in solving linear programs
- Complete experiment (1960 runs) took 1 hour
- 3 GHz Pentium 4
- Written in MATLAB
- CPLEX 9.0 used to solve the linear programs
- Gaussian kernel written in C
16Sonar Dataset208 Points in R60
?
Cross-validation accuracy
Number of features used
17Related Work
- Approaches that use specialized kernels
- Weston, Mukherjee, Chapelle, Pontil, Poggio, and
Vapnik, 2000 structural risk minimization - Gold, Holub, and Sollich, 2005 Bayesian
interpretation - Zhang, 2006 smoothing spline ANOVA kernels
- Margin-based approach
- Frölich and Zell, 2004 remove features if there
is little change to the margin if they are
removed - Other approaches which combine feature selection
with basis reduction - Bi, Bennett, Embrechts, Breneman, and Song, 2003
- Avidan, 2004
18Future Work
- Datasets with more features
- Reduce the number of objective function
evaluations - Limit the number of integer cycles
- Other ways to update the integer variables
- Application to regression problems
- Automatic choice of ?
19Algorithm
- Global solution to nonlinear mixed-integer
program cannot be found efficiently - Requires solving 2n linear programs
- For fixed values of the integer diagonal matrix,
the problem is reduced to an ordinary SVM linear
program - Solution strategy alternate optimization of
continuous and integer variables - For fixed values of E, solve a linear program for
(u, ?, y, s) - For fixed values of (u, ?, s), sweep through the
components of E and make updates which decrease
the objective function
20Notation
- Data points represented as rows of an m n
matrix A - Data labels of 1 or -1 are given as elements of
an m m diagonal matrix D - Example
- XOR 4 points in R2
- Points (0, 1) , (1, 0) have label 1
- Points (0, 0) , (1, 1) have label ?1
- Kernel K(A, B) Rmn Rnk ! Rmk
- Linear kernel (K(A, B))ij (AB)ij AiBj
K(Ai, Bj) - Gaussian kernel, parameter ? (K(A, B))ij
exp(-?Ai0 - Bj2)
21Methodology
- UCI Datasets
- To reduce running time, 1/11 of each dataset was
used as a tuning set to select ? and the kernel
parameter - Remaining 10/11 used for 10-fold cross validation
- Procedure repeated 5 times for each dataset with
different random choice of tuning set each time - NDCC
- Generate multiple datasets with 200 training, 200
tuning, and 1000 testing points