Feature Selection in Nonlinear Kernel Classification - PowerPoint PPT Presentation

About This Presentation

Title:

Feature Selection in Nonlinear Kernel Classification

Description:

Best linear classifier that uses only 1 feature selects the feature x1 ... Leads to a nonlinear mixed-integer program ... (e0Ee) to the objective with weight 0 ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 22

Provided by: tedw5

Learn more at: https://ftp.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Feature Selection in Nonlinear Kernel Classification

1
Feature Selection in Nonlinear Kernel
Classification
Workshop on Optimization-Based Data Mining
Techniques with Applications IEEE International
Conference on Data Mining Omaha, Nebraska,
October 28, 2007

Olvi Mangasarian Edward Wild
University of Wisconsin
Madison

2
Example
However, data is nonlinearly separable using only
the feature x2
Best linear classifier that uses only 1 feature
selects the feature x1
x2
Data is nonlinearly separable In general
nonlinear kernels use both x1 and x2

_
_
_
_
_
_
_
_

x1
Feature selection in nonlinear classification is
important
3
Outline

Minimize the number of input space features
selected by a nonlinear kernel classifier
Start with a standard 1-norm nonlinear support
vector machine (SVM)
Add 0-1 diagonal matrix to suppress or keep
features
Leads to a nonlinear mixed-integer program
Introduce algorithm to obtain a good local
solution to the resulting mixed-integer program
Evaluate algorithm on two public datasets from
the UCI repository and synthetic NDCC data

4
Support Vector Machines
Linear kernel (K(A, B))ij (AB)ij AiBj
K(Ai, Bj) Gaussian kernel, parameter ? (K(A,
B))ij exp(-?Ai0-Bj2)
SVMs

x 2 Rn
SVM defined by parameters u and threshold ? of
the nonlinear surface
A contains all data points
½ A
?? ½ A?
e is a vector of ones

K(A, A0)u e? e
K(A?, A0)u e? ?e
Minimize e0y (hinge loss or plus function or
max, 0) to fit data
Minimize e0s (u1 at solution) to reduce
overfitting
K(x0, A0)u ????
K(x0, A0)u ??
Slack variable y 0 allows points to be on the
wrong side of the bounding surface
K(x0, A0)u ??1?
5
Reduced Feature SVM

To suppress features, add the number of features
present (e0Ee) to the objective with weight ? 0
As ? is increased, more features will be removed
from the classifier

Start with Full SVM
Replace A with AE, where E is a diagonal n n
matrix with Eii 2 1, 0, i 1, , n
All features are present in the kernel matrix
K(A, A0)
If Eii is 0 the ith feature is removed
6
Reduced Feature SVM (RFSVM)

Initialize diagonal matrix E randomly
For fixed 0-1 values E, solve the SVM linear
program to obtain (u, ?, y, s)
3)Fix (u, ?, s) and sweep through E repeatedly as
follows
For each component of E replace 1 by 0 and
conversely provided the change decreases the
overall objective function by more than tol
4)Go to (3) if a change was made in the last
sweep, otherwise continue to (5)
5)Solve the SVM linear program with the new
matrix E. If the objective decrease is less than
tol, stop, otherwise go to (3)

7
RFSVM Convergence (for tol 0)

Objective function value converges
Each step decreases the objective
Objective is bounded below by 0
Limit of the objective function value is attained
at any accumulation point of the sequence of
iterates
Accumulation point is a local minimum solution
Continuous variables are optimal for the fixed
integer variables
Changing any single integer variable will not
decrease the objective

8
Experimental Results

Classification accuracy versus number of features
used
Compare our RFSVM to Relief and RFE
(Recursive Feature Elimination)
Results given on two public datasets from the UCI
repository
Ability of RFSVM to handle problems with up to
1000 features tested on synthetic NDCC datasets
Set feature selection parameter ?? 1

9
Relief and RFE

Relief
Kira and Rendell, 1992
Filter method feature selection is a
preprocessing procedure
Features are selected as relevant if they tend to
have different feature values for points in
different classes
RFE (Recursive Feature Elimination)
Guyon, Weston, Barnhill, and Vapnik, 2002
Wrapper method feature selection is based on
classification
Features are selected as relevant if removing
them causes a large change in the margin of an SVM

10
Ionosphere Dataset 351 Points in R34
If the appropriate value of ? is selected, RFSVM
can obtain higher accuracy using fewer features
than SVM1
Nonlinear SVM with no feature selection
?
Even for feature selection parameter ? 0, some
features may be removed when removing them
decreases the hinge loss
Cross-validation accuracy
Note that accuracy decreases slightly until
about 10 features remain, and then decreases more
sharply as they are removed
Linear 1-norm SVM
Number of features used
11
Normally Distributed Clusters on Cubes Dataset
(Thompson, 2006)

Points are generated from normal distributions
centered at vertices of 1-norm cubes
Dataset is not linearly separable

12
RFSVM vs. SVM without Feature Selection (NKSVM1)
on NDCC Data with 20 True Features and Varying
Numbers of Irrelevant Features
Each point is the average test set correctness
over 10 datasets with 200 training, 200 tuning,
and 1000 testing points
RFSVM vs. SVM without Feature Selection (NKSVM1)
on NDCC Data with 100 True Features and 1000
Irrelevant Features
When 480 irrelevant features are added, the
accuracy of RFSVM is 45 higher than that of
NKSVM1
13
Conclusion

New rigorous formulation with precise objective
for feature selection in nonlinear SVM
classifiers
Obtain a local solution to the resulting
mixed-integer program
Alternate between a linear program to compute
continuous variables and successive sweeps to
update the integer variables
Efficiently learns accurate nonlinear classifiers
with reduced numbers of features
Handles problems with 1000 features, 900 of which
are irrelevant

14
Questions?

Websites with links to papers and talks
http//www.cs.wisc.edu/olvi
http//www.cs.wisc.edu/wildt
NDCC generator
http//www.cs.wisc.edu/dmi/svm/ndcc/

15
Running Time on the Ionosphere Dataset

Averages 5.7 sweeps through the integer variables
Averages 3.4 linear programs
75 of the time consumed in objective function
evaluations
15 of time consumed in solving linear programs
Complete experiment (1960 runs) took 1 hour
3 GHz Pentium 4
Written in MATLAB
CPLEX 9.0 used to solve the linear programs
Gaussian kernel written in C

16
Sonar Dataset208 Points in R60
?
Cross-validation accuracy
Number of features used
17
Related Work

Approaches that use specialized kernels
Weston, Mukherjee, Chapelle, Pontil, Poggio, and
Vapnik, 2000 structural risk minimization
Gold, Holub, and Sollich, 2005 Bayesian
interpretation
Zhang, 2006 smoothing spline ANOVA kernels
Margin-based approach
Frölich and Zell, 2004 remove features if there
is little change to the margin if they are
removed
Other approaches which combine feature selection
with basis reduction
Bi, Bennett, Embrechts, Breneman, and Song, 2003
Avidan, 2004

18
Future Work

Datasets with more features
Reduce the number of objective function
evaluations
Limit the number of integer cycles
Other ways to update the integer variables
Application to regression problems
Automatic choice of ?

19
Algorithm

Global solution to nonlinear mixed-integer
program cannot be found efficiently
Requires solving 2n linear programs
For fixed values of the integer diagonal matrix,
the problem is reduced to an ordinary SVM linear
program
Solution strategy alternate optimization of
continuous and integer variables
For fixed values of E, solve a linear program for
(u, ?, y, s)
For fixed values of (u, ?, s), sweep through the
components of E and make updates which decrease
the objective function

20
Notation

Data points represented as rows of an m n
matrix A
Data labels of 1 or -1 are given as elements of
an m m diagonal matrix D
Example
XOR 4 points in R2
Points (0, 1) , (1, 0) have label 1
Points (0, 0) , (1, 1) have label ?1
Kernel K(A, B) Rmn Rnk ! Rmk
Linear kernel (K(A, B))ij (AB)ij AiBj
K(Ai, Bj)
Gaussian kernel, parameter ? (K(A, B))ij
exp(-?Ai0 - Bj2)

21
Methodology

UCI Datasets
To reduce running time, 1/11 of each dataset was
used as a tuning set to select ? and the kernel
parameter
Remaining 10/11 used for 10-fold cross validation
Procedure repeated 5 times for each dataset with
different random choice of tuning set each time
NDCC
Generate multiple datasets with 200 training, 200
tuning, and 1000 testing points