Title: Support Vector Machines
1Support Vector Machines
Session 8
Dr. N.B. Venkateswarlu AITAM, Tekkali
2Overview
- Background
- Linear Classifier
- SVM
- Margin
- Non-Linear SVM
- Kernel Functions
- Java Demo Applets
3Some Background
- In the machine learning context, a vector goes
like this - Each attribute is a dimension of the vector.
- The inner product or dot product is defined as
below
4Linearly Seperable Classes
5(No Transcript)
6Seperating Planes
7a
Linear Classifiers
x
f
yest
f(x,w,b) sign(w x b)
denotes 1 denotes -1
w x bgt0
w x b0
How would you classify this data?
w x blt0
8a
Linear Classifiers
x
f
yest
f(x,w,b) sign(w x b)
denotes 1 denotes -1
How would you classify this data?
9a
Linear Classifiers
x
f
yest
f(x,w,b) sign(w x b)
denotes 1 denotes -1
How would you classify this data?
10a
Linear Classifiers
x
f
yest
f(x,w,b) sign(w x b)
denotes 1 denotes -1
Any of these would be fine.. ..but which is best?
11a
Linear Classifiers
x
f
yest
f(x,w,b) sign(w x b)
denotes 1 denotes -1
How would you classify this data?
Misclassified to 1 class
12a
a
Classifier Margin
Classifier Margin
x
x
f
f
yest
yest
f(x,w,b) sign(w x b)
f(x,w,b) sign(w x b)
denotes 1 denotes -1
denotes 1 denotes -1
Define the margin of a linear classifier as the
width that the boundary could be increased by
before hitting a datapoint.
Define the margin of a linear classifier as the
width that the boundary could be increased by
before hitting a datapoint.
13Some Background The Perceptron
- Goal Find a plane in the n-dimension input space
that classify the data. - The classifier trained has the form
- y is the class label predicted by the perceptron,
x is the example (instance, vector) to be
classified and m is the weight vector and b is
the offset. - Main Idea Each attribute is assigned a weight,
negative, zero or positive. The sum of all these
weights multiplied by the instance value for this
attribute is (more or less) the class tag. The
final decision rule is - If yi gt 0 then class positive, If yi lt 0 then
class negative
14Non-Linearly Seperable classes
15Probable Misclassifications
16SVMs
- Support Vector Machines
- To summarize A SVM finds an hyperplace
separating the training set in a feature space
induced by a kernel function used as the inner
product in the algorithm. - The solution for the margin optimization process
is sparse in a. Which means that only a few
examples are effectively used in the classifier.
These examples are the closest to the classifying
boundary, so they SUPPORT this hyperplane. The
vectors supports the classifier, Support Vector
Machine.
17a
Maximum Margin
x
f
yest
- Maximizing the margin is good according to
intuition and PAC theory - Implies that only support vectors are important
other training examples are ignorable. - Empirically it works very very well.
f(x,w,b) sign(w x b)
denotes 1 denotes -1
The maximum margin linear classifier is the
linear classifier with the, um, maximum
margin. This is the simplest kind of SVM (Called
an LSVM)
Support Vectors are those datapoints that the
margin pushes up against
Linear SVM
18Linear SVM Mathematically
x
MMargin Width
Predict Class 1 zone
X-
wxb1
Predict Class -1 zone
wxb0
wxb-1
- What we know
- w . x b 1
- w . x- b -1
- w . (x-x-) 2
19Linear SVM Mathematically
- Goal 1) Correctly classify all training data
-
if yi 1 -
if yi -1 -
for all i - 2) Maximize the Margin
- same as minimize
- We can formulate a Quadratic Optimization Problem
and solve for w and b - Minimize
-
- subject to
20Example of linear SVM
Support vectors
margin
21Support Vectors
22Solving the Optimization Problem
Find w and b such that F(w) ½ wTw is minimized
and for all (xi ,yi) yi (wTxi b) 1
- Need to optimize a quadratic function subject to
linear constraints. - Quadratic optimization problems are a well-known
class of mathematical programming problems, and
many (rather intricate) algorithms exist for
solving them. - The solution involves constructing a dual problem
where a Lagrange multiplier ai is associated with
every constraint in the primary problem
Find a1aN such that Q(a) Sai -
½SSaiajyiyjxiTxj is maximized and (1) Saiyi
0 (2) ai 0 for all ai
23The Optimization Problem Solution
- The solution has the form
- Each non-zero ai indicates that corresponding xi
is a support vector. - Then the classifying function will have the form
- Notice that it relies on an inner product between
the test point x and the support vectors xi we
will return to this later. - Also keep in mind that solving the optimization
problem involved computing the inner products
xiTxj between all pairs of training points.
w Saiyixi b yk- wTxk for any xk
such that ak? 0
f(x) SaiyixiTx b
24Dataset with noise
- Hard Margin So far we require all data points be
classified correctly - - No training error
- What if the training set is noisy?
- - Solution 1 use very powerful kernels
OVERFITTING!
25Soft Margin Classification
Slack variables ?i can be added to allow
misclassification of difficult or noisy examples.
What should our quadratic optimization criterion
be? Minimize
26Hard Margin v.s. Soft Margin
- The old formulation
- The new formulation incorporating slack
variables - Parameter C can be viewed as a way to control
overfitting.
Find w and b such that F(w) ½ wTw is minimized
and for all (xi ,yi) yi (wTxi b) 1
Find w and b such that F(w) ½ wTw CS?i is
minimized and for all (xi ,yi) yi (wTxi b)
1- ?i and ?i 0 for all i
27Linear SVMs Overview
- The classifier is a separating hyperplane.
- Most important training points are support
vectors they define the hyperplane. - Quadratic optimization algorithms can identify
which training points xi are support vectors with
non-zero Lagrangian multipliers ai. - Both in the dual formulation of the problem and
in the solution training points appear only
inside dot products
Find a1aN such that Q(a) Sai -
½SSaiajyiyjxiTxj is maximized and (1) Saiyi
0 (2) 0 ai C for all ai
f(x) SaiyixiTx b
28Linear SVM for non-seperable data
29Non-linear SVM for linearly non-seperable data
30Non-linear SVMs
- Datasets that are linearly separable with some
noise work out great - But what are we going to do if the dataset is
just too hard? - How about mapping data to a higher-dimensional
space
0
x
31Non-linear SVMs Feature spaces
- General idea the original input space can
always be mapped to some higher-dimensional
feature space where the training set is separable
F x ? f(x)
32Kernel methods approach
- The kernel methods approach is to stick with
linear functions but work in a high dimensional
feature space - The expectation is that the feature space has a
much higher dimension than the input space.
33Kernel methods
data
Identified pattern
kernel
subspace
Pattern Analysis algorithm
34Mapping
35Mappping is know as Kernel Functions
- Here is a training set in the input space I and
the same training in the feature space F. We go
from a 2D space to a 3D space.
The training set is not linearly separable in the
input space.
The training set is linearly separable in the
feature space. This is called the Kernel Trick.
36Mappping is know as Kernel Functions
- Here is a training set in the input space I and
the same training in the feature space F. We go
from a 2D space to a 3D space.
The training set is not linearly separable in the
input space.
37Mappping is know as Kernel Functions
The training set is not linearly separable in the
input space.
38(No Transcript)
39Classes seperable after mapping
40Example
- Consider the mapping
- If we consider a linear equation in this feature
space - We actually have an ellipse i.e. a non-linear
shape in the input space.
41Capacity of feature spaces
- The capacity is proportional to the dimension
for example - 2-dim
42- If data are mapped into a space of sufficiently
high dimension, they will always be linearly
separable (N data points in N-1 dimensions or
more) - Problem Linear separator in space of d
dimensions have d parameters problem of over
fitting - Reason for maximal margin/optimal separator
43Form of the functions
- So kernel methods use linear functions in a
feature space - For regression this could be the function
- For classification require thresholding
44Problems of high dimensions
- Capacity may easily become too large and lead to
overfitting being able to realise every
classifier means unlikely to generalise well - Computational costs involved in dealing with
large vectors
45Kernel Functions
- To make the data linearly separable we could
- Project the data from the input space to a new
space called feature space - This feature space having more dimensions than
the input space we could separate the data THERE - Using the normal Adatron (linear SVM)
46Example of polynomial kernel.
- r degree polynomial
- K(x,x)(1ltx,xgt)d.
- For a feature space with two inputs x1,x2 and
- a polynomial kernel of degree 2.
- K(x,x)(1ltx,xgt)2
- Let
- and , then
K(x,x)lth(x),h(x)gt.
47Kernel Functions
- Lets use an example projection
- The inner product of two vectors x and y
projected in the space F becomes
48(No Transcript)
49Kernel Functions
- But what happens if, instead of projecting the
data we just do
- Which means taking the input space inner
product and squaring it. - It will actually lead to the feature space inner
product!!!
This is much less time consuming as we are
implicitly projecting the training set.
50Kernel Functions
- So we could define a kernel function as follows
It is the function that represents the inner
product of some space in ANOTHER space. - Some spaces are known only by their kernel
function - (ie their projection is UNKNOWN)
- Its the case with these kernel functions
- Gaussian RBF kernel
- Sigmoïd kernel
- An experimental kernel called KMOD
51(No Transcript)
52Radial Basis Functions
53Sigmoidal Function
54The Kernel Trick
- The linear classifier relies on dot product
between vectors K(xi,xj)xiTxj - If every data point is mapped into
high-dimensional space via some transformation F
x ? f(x), the dot product becomes - K(xi,xj) f(xi) Tf(xj)
- A kernel function is some function that
corresponds to an inner product in some expanded
feature space. - Example
- 2-dimensional vectors xx1 x2 let
K(xi,xj)(1 xiTxj)2, - Need to show that K(xi,xj) f(xi) Tf(xj)
- K(xi,xj)(1 xiTxj)2,
- 1 xi12xj12 2
xi1xj1 xi2xj2 xi22xj22 2xi1xj1 2xi2xj2 - 1 xi12 v2 xi1xi2 xi22 v2xi1
v2xi2T 1 xj12 v2 xj1xj2 xj22 v2xj1 v2xj2
- f(xi) Tf(xj), where f(x) 1 x12
v2 x1x2 x22 v2x1 v2x2
55What Functions are Kernels?
- For some functions K(xi,xj) checking that
- K(xi,xj) f(xi) Tf(xj) can be
cumbersome. - Mercers theorem
- Every semi-positive definite symmetric function
is a kernel - Semi-positive definite symmetric functions
correspond to a semi-positive definite symmetric
Gram matrix
K
56Examples of Kernel Functions
- Linear K(xi,xj) xi Txj
- Polynomial of power p K(xi,xj) (1 xi Txj)p
- Gaussian (radial-basis function network)
- Sigmoid K(xi,xj) tanh(ß0xi Txj ß1)
57Non-linear SVMs Mathematically
- Dual problem formulation
- The solution is
- Optimization techniques for finding ais remain
the same!
Find a1aN such that Q(a) Sai -
½SSaiajyiyjK(xi, xj) is maximized and (1) Saiyi
0 (2) ai 0 for all ai
f(x) SaiyiK(xi, xj) b
58Nonlinear SVM - Overview
- SVM locates a separating hyperplane in the
feature space and classify points in that space - It does not need to represent the space
explicitly, simply by defining a kernel function - The kernel function plays the role of the dot
product in the feature space.
59Properties of SVM
- Flexibility in choosing a similarity function
- Sparseness of solution when dealing with large
data sets - - only support vectors are used to specify
the separating hyperplane - Ability to handle large feature spaces
- - complexity does not depend on the
dimensionality of the feature space - Overfitting can be controlled by soft margin
approach - Nice math property a simple convex optimization
problem which is guaranteed to converge to a
single global solution - Feature Selection
60SVM Applications
- SVM has been used successfully in many real-world
problems - - text (and hypertext) categorization
- - image classification
- - bioinformatics (Protein classification,
- Cancer classification)
- - hand-written character recognition
61Application 1 Cancer Classification
- High Dimensional
- - pgt1000 nlt100
- Imbalanced
- - less positive samples
- Many irrelevant features
- Noisy
FEATURE SELECTION In the linear case, wi2 gives
the ranking of dim i
SVM is sensitive to noisy (mis-labeled) data ?
62(No Transcript)
63(No Transcript)
64(No Transcript)
65(No Transcript)
66(No Transcript)
67(No Transcript)
68(No Transcript)
69(No Transcript)
70(No Transcript)
71(No Transcript)
72(No Transcript)
73(No Transcript)
74(No Transcript)
75(No Transcript)
76SVMs
- There are many ways to implement this
optimization process. - The Kernel-Adatron is one. The simplest since
its derived from the very well known perceptron. - Its simple but the slowest (awfully, painfully
slow) since it passes through all the examples
MANY times (many epoch) - The first approach used was Quadratic Programming
since this optimization problem is quadratic. - Subject to the complex quadratic programming
theory - Many numerical issues due to the method
- An entire QP matrix for a sparse solution
- Rather slow as well
- Chunking chops the QP matrix in smaller chucks to
gain speed from the sparseness. Does work but
sometimes the improvement is unsignificant. - SMO stands for Sequential Minimal Optimization.
Its Chunking to its finest grain. Using heavy
heuristics, points are optimized by pair. - Very fast
- Well documented, see John Platt on Google
77Kernel function details
- The first kernel shown is called the polynomial
kernel
Where n is its order (in our example n2) and b
is called the lower order term (in our example
b0).
- Why do we need a lower order term?
- Because the origin of the input space matches the
origin of the feature space induced by the
polynomial kernel. - It serves as an offset or a shift for the origin
of the feature space from the input space origin. - When given the choice, its strongly suggested
you use it.
78Kernel function details
- The most popular kernel function (most powerful)
is the Gaussian RBF kernel
Powerful kernel as its effect is to create a
small classification hyperball around an
instance. This kernel doesnt have a projection
formula since its dimension is infinite (you can
create as many balls as you want).
Where s is a measure of the radius of the
hyperball around an instance. You want this
ball to be big enough so hyperballs connect
with each other (pattern recognition) but not too
big to overlap the other class.
79Kernel function details
- Gaussian RBF overfits badly
- The Christmas Tree effect.
- If you select s too small for the distance
between your examples, you will create small
balls around all your instances. - You will get a 100 accuracy on your training
set, wonderful! - But only your examples are classified (balls are
too small). - In my applet, enter a few examples manually
(using mouse). Make them sparse. Use the Gaussian
RBF kernel with a significantly smaller s than
the proposed values. HERE is YOUR Christmas Tree - your examples created small color balls, like
Christmas tree balls. - The Christmas Tree effect is a joke. But your
classifier is a joke too. No classification of
new instances are possible with it.
80Karush-Kuhn-Tucker conditions???
- Just some conditions that are necessary for a
solution in non-linear programming to be optimal
81Common Kernels
- Polynomial
- Radial basis function
- Sigmoid
82Some RBFs
83(No Transcript)
84(No Transcript)
85(No Transcript)
86Softmargin method
- Modified maximum margin idea allows for
mislabeled examples during training - Still creates a hyperplane that separates as
cleanly as possible - The hyperplane maximizes the distance to the
nearest cleanly split examples
87SVM_light
- SVM package
- Allows you to choose type (linear vs non-linear)
and/or Kernel function - Two executables svm_learn and svm_classify
88SVM_light
- ltclassgt ltfeature nrgtltfeaturegt
- Example
- 1 10.7 20.3 30.5
-
- -1 11.2 20.6 30.9
89Mathematical background
- The support vector (SV) machine is a new type of
learning machine. It is based on statistical
learning theory.
90Objective
- Use linear support vector machines (SVMs) to
classify 2-D data.
91Background
- Suppose we want to find a decision function f
with the property f(xi) yi, ?i. - (1)
- Â
- In practice, a separating hyperplane often does
not exist. To allow for the possibility of
examples violating (1), the slack variables ?i
are introduced. - (2)
- to get
- (3)
92Background(2)
The SV approach to minimizing the guaranteed risk
bound consists of the following.
Minimize (4) subject to the constraints
(2) and (3). Introducing Lagrange multipliers ?i
and using the Kuhn_Tucker theorem of optimization
theory, the solution can be shown to have an
expansion (5) with nonzero coefficients ?i
only where the corresponding example (xi, yi)
precisely meets the constraint (3). These xi are
called support vectors. All remaining examples of
the training set are irrelevant.
93Background(3)
The constraint (3) is satisfied automatically
(with ?i 0), and they do not appear in the
expansion (5). The coefficients ?i are found by
solving the following quadratic programming
problem. Maximize (6) subject
to and (7) By linearity of the dot
product, the decision function can be written
as (8)
94Background (4)
To allow for much more general decision surfaces,
one can first nonlinearly transform a set of
input vectors x1, , xl into a high-dimensional
feature space. The decision function
becomes (9) Where RBF Kernels
are (10)
95Principal Component Analysis
- Given N data vectors from k-dimensions, find c lt
k orthogonal vectors that can be best used to
represent data - The original data set is reduced to one
consisting of N data vectors on c principal
components (reduced dimensions) - Each data vector is a linear combination of the c
principal component vectors - Works for numeric data only
- Used when the number of dimensions is large
96Principal Component Analysis
97Principal Component Analysis
- Aimed at finding new co-ordinate system which has
some characteristics. - M4.5 4.25
- Cov Matrix 2.57 1.86
- 1.86 6.21
- Eigen Values 6.99, 1.79
- Eigen Vectors 0.387 0.922
- -0.922 0.387
98(No Transcript)
99However in some cases it is not possible to have
PCA working.
100Canonical Analysis
101- Unlike PCA which takes global mean and
covariance, this takes between the group and
within the group covariance matrix and the
calculates canonical axes.
102(No Transcript)
103Underfitting and overfitting
104Non-Seperable Trining Sets
105SVMs works badly with outliers