Title: Classification via Mathematical Programming Based Support Vector Machines
1 Classification via Mathematical Programming
Based Support Vector Machines
November 26, 2002
Computer Sciences Dept. University of Wisconsin -
Madison
2Outline of Talk
- (Standard) Support vector machines (SVM)
- Classify by halfspaces
- Proximal support vector machines (PSVM)
- Classify by proximity to planes
- Numerical experiments
- Incremental PSVM classifiers
- Synthetic dataset consisting of 1 billion points
in 10- dimensional input space
classified in less than 2 hours and 26 minutes
seconds - Knowledge based linear SVMs
- Incorporating knowledge sets into a classifier
- Numerical experiments
3Support Vector MachinesMaximizing the Margin
between Bounding Planes
A
A-
4Standard Support Vector MachineAlgebra of
2-Category Linearly Separable Case
5Standard Support Vector Machine Formulation
6Proximal Vector Machines (KDD 2002)Fitting the
Data using two parallel Bounding Planes
A
A-
7PSVM Formulation
We have from the QP SVM formulation
This simple, but critical modification, changes
the nature of the optimization problem
tremendously!!
8Advantages of New Formulation
- Objective function remains strongly convex
- An explicit exact solution can be written in
terms of the problem data - PSVM classifier is obtained by solving a single
system of linear equations in the usually small
dimensional input space - Exact leave-one-out-correctness can be obtained
in terms of problem data
9Linear PSVM
- Setting the gradient equal to zero, gives a
nonsingular system of linear equations. - Solution of the system gives the desired PSVM
classifier
10Linear PSVM Solution
11Linear Proximal SVM Algorithm
12Nonlinear PSVM Formulation
13The Nonlinear Classifier
- Where K is a nonlinear kernel, e.g.
14Nonlinear PSVM
However, reduced kernels techniques can be used
(RSVM) to reduce dimensionality.
15 Linear Proximal SVM Algorithm
Non
Solve
16Linear Nonlinear PSVM MATLAB Code
function w, gamma psvm(A,d,nu) PSVM linear
and nonlinear classification INPUT A,
ddiag(D), nu. OUTPUT w, gamma w, gamma
psvm(A,d,nu) m,nsize(A)eones(m,1)HA
-e v(dH) vHDe
r(speye(n1)/nuHH)\v solve (I/nuHH)rv
wr(1n)gammar(n1) getting w,gamma from
r
17Linear PSVM Comparisons with Other SVMsMuch
Faster, Comparable Correctness
18Linear PSVM vs LSVM 2-Million Dataset Over 30
Times Faster
19Nonlinear PSVM Spiral Dataset94 Red Dots 94
White Dots
20Nonlinear PSVM Comparisons
A rectangular kernel was used of size 8124 x 215
21Conclusion
- PSVM is an extremely simple procedure for
generating linear and nonlinear classifiers - PSVM classifier is obtained by solving a single
system of linear equations in the usually
small dimensional input space for a linear
classifier - Comparable test set correctness to standard SVM
- Much faster than standard SVMs typically an
order of magnitude less.
22Incremental PSVM Classification(Second SIAM Data
Mining Conference)
23Linear Incremental Proximal SVM Algorithm
24Linear Incremental Proximal SVM Adding Retiring
Data
- Capable of modifying an existing linear
classifier by both adding and retiring data - Option of retiring old data is similar to adding
new data - Financial Data old data is obsolete
- Option of keeping old data and merging it with
the new data - Medical Data old data does not obsolesce.
25Numerical experimentsOne-Billion Two-Class
Dataset
- Synthetic dataset consisting of 1 billion points
in 10- dimensional input space - Generated by NDC (Normally Distributed
Clustered) dataset generator - Dataset divided into 500 blocks of 2 million
points each. - Solution obtained in less than 2 hours and 26
minutes - About 30 of the time was spent reading data
from disk. - Testing set Correctness 90.79
26Numerical Experiments Simulation of Two-month
60-Million Dataset
- Synthetic dataset consisting of 60 million
points (1 million per day) in 10- dimensional
input space - Generated using NDC
- At the beginning, we only have data
corresponding to the first month - Every day
- The oldest block of data is retired (1 Million)
- A new block is added (1 Million)
- A new linear classifier is calculated daily
- Only an 11 by 11 matrix is kept in memory at the
end of each day. All other data is purged.
27Numerical experimentsSeparator changing through
time
28Numerical experiments Normals to the separating
hyperplanes Corresponding to 5 day intervals
29Conclusion
- Proposed algorithm is an extremely simple
procedure for generating linear classifiers in an
incremental fashion for huge datasets. - The linear classifier is obtained by solving a
single system of linear equations in the
small dimensional input space. - The proposed algorithm has the ability to retire
old data and add new data in a very simple
manner. - Only a matrix of the size of the input space is
kept in memory at any time
30Support Vector MachinesLinear Programming
Formulation
- Use the 1-norm instead of the 2-norm
- This is equivalent to the following linear
program
31Conventional Data-Based SVM
32Knowledge-Based SVM via Polyhedral Knowledge
Sets (NIPS 2002)
33Incoporating Knowledge Sets Into an SVM
Classifier
- Will show that this implication is equivalent to
a set of constraints that can be imposed on the
classification problem.
34Knowledge Set Equivalence Theorem
35Proof of Equivalence Theorem( Via Nonhomogeneous
Farkas or LP Duality)
Proof By LP Duality
36Knowledge-Based SVM Classification
37Knowledge-Based SVM Classification
38Parametrized Knowledge-Based LP
39Numerical TestingThe Promoter Recognition Dataset
- Promoter Short DNA sequence that precedes a
gene sequence. - A promoter consists of 57 consecutive DNA
nucleotides belonging to A,G,C,T . - Important to distinguish between promoters and
nonpromoters - This distinction identifies starting locations
of genes in long uncharacterized DNA sequences.
40The Promoter Recognition DatasetNumerical
Representation
- Simple 1 of N mapping scheme for converting
nominal attributes into a real valued
representation
- Not most economical representation, but commonly
- used.
41The Promoter Recognition DatasetNumerical
Representation
- Feature space mapped from 57-dimensional nominal
space to a real valued 57 x 4228 dimensional
space.
57 nominal values
57 x 4 228 binary values
42Promoter Recognition Dataset Prior Knowledge
Rules
- Prior knowledge consist of the following 64
rules
43Promoter Recognition Dataset Sample Rules
44The Promoter Recognition DatasetComparative
Algorithms
- KBANN Knowledge-based artificial neural network
Shavlik et al - BP Standard back propagation for neural
networks Rumelhart et al - ONeills Method Empirical method suggested by
biologist ONeill ONeill - NN Nearest neighbor with k3 Cost et al
- ID3 Quinlans decision tree builderQuinlan
- SVM1 Standard 1-norm SVM Bradley et al
45The Promoter Recognition DatasetComparative Test
Results
46Wisconsin Breast Cancer Prognosis Dataset
Description of the data
- 110 instances corresponding to 41 patients
whose cancer had recurred and 69 patients whose
cancer had not recurred - 32 numerical features
- The domain theory two simple rules used by
doctors
47Wisconsin Breast Cancer Prognosis Dataset
Numerical Testing Results
- Doctors rules applicable to only 32 out of 110
patients. - Only 22 of 32 patients are classified correctly
by this rule (20 Correctness). - KSVM linear classifier applicable to all
patients with correctness of 66.4. - Correctness comparable to best available
results using conventional SVMs. - KSVM can get classifiers based on knowledge
without using any data. -
48Conclusion
- Prior knowledge easily incorporated into
classifiers through polyhedral knowledge sets. - Resulting problem is a simple LP.
- Knowledge sets can be used with or without
conventional labeled data. - In either case KSVM is better than most
knowledge based classifiers.
49Breast Cancer Treatment ResponseJoint with
ExonHit ( French BioTech)
- 35 patients treated by a drug cocktail
- 9 partial responders 26 nonresponders
- 25 gene expression measurements made on each
patient - 1-Norm SVM classifier selected 12 out of 25
genes - Combinatorially selected 6 genes out of 12
- Separating plane obtained
- 2.7915 T11 0.13436 S24 -1.0269 U23 -2.8108 Z23
-1.8668 A19 -1.5177 X05 2899.1 0. - Leave-one-out-error 1 out of 35 (97.1
correctness)
50Other papers
- A fast and Global Two Point Low Storage
Optimization Technique for Tracing Rays in 2D and
3D Isotropic Media (Journal of Applied
Geophysics) - Semi-Supervised Support Vector Machines for
Unlabeled data Classification (Optimization
Methods and Software) - Select a small subset of an unlabeled dataset to
be labeled by an oracle or expert - Use the new labeled data and the remaining
unlabeled data to train a SVM clasifier -
51Other papers
- Multicategory Proximal SVM Classifiers
- Fast multicategory algorithm based on PSVM
- Newton refinement step proposed
- Data Selection for SVM Classifiers (KDD 2000)
- Reduce the number of support vectors of a linear
SVM - Minimal Kernel Classifiers (JMLR)
- Use a concave minimization formulation to reduce
the SVM model complexity. - Useful for online testing where testing time is
an issue. -
52Other papers
- A Feature Selection Newton Method for SVM
Classification - LP SVM solved using a Newton method
- Very sparse solutions are obtained
- Finite Newton method for Lagrangian SVM
Classifiers (Neurocomputing Journal) - Very fast performance, specially when ngtm