Support Vector Machines

About This Presentation

Title:

Support Vector Machines

Description:

w =Saiyixi b= yk- wTxk for any xk such that ak 0. f(x) = SaiyixiTx b. www.ritchcenter.com/nbv ... r degree polynomial: K(x,x')=(1 x,x' )d. For a feature ... – PowerPoint PPT presentation

Number of Views:924

Avg rating:3.0/5.0

Slides: 106

Provided by: adityat

Category:

more less

Transcript and Presenter's Notes

Title: Support Vector Machines

1
Support Vector Machines

Session 8
Dr. N.B. Venkateswarlu AITAM, Tekkali
2
Overview

Background
Linear Classifier
SVM
Margin
Non-Linear SVM
Kernel Functions
Java Demo Applets

3
Some Background

In the machine learning context, a vector goes
like this
Each attribute is a dimension of the vector.
The inner product or dot product is defined as
below

4
Linearly Seperable Classes
5
(No Transcript)
6
Seperating Planes
7
a
Linear Classifiers
x
f
yest
f(x,w,b) sign(w x b)
denotes 1 denotes -1
w x bgt0
w x b0
How would you classify this data?
w x blt0
8
a
Linear Classifiers
x
f
yest
f(x,w,b) sign(w x b)
denotes 1 denotes -1
How would you classify this data?
9
a
Linear Classifiers
x
f
yest
f(x,w,b) sign(w x b)
denotes 1 denotes -1
How would you classify this data?
10
a
Linear Classifiers
x
f
yest
f(x,w,b) sign(w x b)
denotes 1 denotes -1
Any of these would be fine.. ..but which is best?
11
a
Linear Classifiers
x
f
yest
f(x,w,b) sign(w x b)
denotes 1 denotes -1
How would you classify this data?
Misclassified to 1 class
12
a
a
Classifier Margin
Classifier Margin
x
x
f
f
yest
yest
f(x,w,b) sign(w x b)
f(x,w,b) sign(w x b)
denotes 1 denotes -1
denotes 1 denotes -1
Define the margin of a linear classifier as the
width that the boundary could be increased by
before hitting a datapoint.
Define the margin of a linear classifier as the
width that the boundary could be increased by
before hitting a datapoint.
13
Some Background The Perceptron

Goal Find a plane in the n-dimension input space
that classify the data.
The classifier trained has the form
y is the class label predicted by the perceptron,
x is the example (instance, vector) to be
classified and m is the weight vector and b is
the offset.
Main Idea Each attribute is assigned a weight,
negative, zero or positive. The sum of all these
weights multiplied by the instance value for this
attribute is (more or less) the class tag. The
final decision rule is
If yi gt 0 then class positive, If yi lt 0 then
class negative

So we can also write

14
Non-Linearly Seperable classes
15
Probable Misclassifications
16
SVMs

Support Vector Machines
To summarize A SVM finds an hyperplace
separating the training set in a feature space
induced by a kernel function used as the inner
product in the algorithm.
The solution for the margin optimization process
is sparse in a. Which means that only a few
examples are effectively used in the classifier.
These examples are the closest to the classifying
boundary, so they SUPPORT this hyperplane. The
vectors supports the classifier, Support Vector
Machine.

17
a
Maximum Margin
x
f
yest

Maximizing the margin is good according to
intuition and PAC theory
Implies that only support vectors are important
other training examples are ignorable.
Empirically it works very very well.

f(x,w,b) sign(w x b)
denotes 1 denotes -1
The maximum margin linear classifier is the
linear classifier with the, um, maximum
margin. This is the simplest kind of SVM (Called
an LSVM)
Support Vectors are those datapoints that the
margin pushes up against
Linear SVM
18
Linear SVM Mathematically
x
MMargin Width
Predict Class 1 zone
X-
wxb1
Predict Class -1 zone
wxb0
wxb-1

What we know
w . x b 1
w . x- b -1
w . (x-x-) 2

19
Linear SVM Mathematically

Goal 1) Correctly classify all training data
if yi 1
if yi -1
for all i
2) Maximize the Margin
same as minimize
We can formulate a Quadratic Optimization Problem
and solve for w and b
Minimize
subject to

20
Example of linear SVM
Support vectors
margin
21
Support Vectors
22
Solving the Optimization Problem
Find w and b such that F(w) ½ wTw is minimized
and for all (xi ,yi) yi (wTxi b) 1

Need to optimize a quadratic function subject to
linear constraints.
Quadratic optimization problems are a well-known
class of mathematical programming problems, and
many (rather intricate) algorithms exist for
solving them.
The solution involves constructing a dual problem
where a Lagrange multiplier ai is associated with
every constraint in the primary problem

Find a1aN such that Q(a) Sai -
½SSaiajyiyjxiTxj is maximized and (1) Saiyi
0 (2) ai 0 for all ai
23
The Optimization Problem Solution

The solution has the form
Each non-zero ai indicates that corresponding xi
is a support vector.
Then the classifying function will have the form
Notice that it relies on an inner product between
the test point x and the support vectors xi we
will return to this later.
Also keep in mind that solving the optimization
problem involved computing the inner products
xiTxj between all pairs of training points.

w Saiyixi b yk- wTxk for any xk
such that ak? 0
f(x) SaiyixiTx b
24
Dataset with noise

Hard Margin So far we require all data points be
classified correctly
- No training error
What if the training set is noisy?
- Solution 1 use very powerful kernels

OVERFITTING!
25
Soft Margin Classification
Slack variables ?i can be added to allow
misclassification of difficult or noisy examples.
What should our quadratic optimization criterion
be? Minimize
26
Hard Margin v.s. Soft Margin

The old formulation
The new formulation incorporating slack
variables
Parameter C can be viewed as a way to control
overfitting.

Find w and b such that F(w) ½ wTw is minimized
and for all (xi ,yi) yi (wTxi b) 1
Find w and b such that F(w) ½ wTw CS?i is
minimized and for all (xi ,yi) yi (wTxi b)
1- ?i and ?i 0 for all i
27
Linear SVMs Overview

The classifier is a separating hyperplane.
Most important training points are support
vectors they define the hyperplane.
Quadratic optimization algorithms can identify
which training points xi are support vectors with
non-zero Lagrangian multipliers ai.
Both in the dual formulation of the problem and
in the solution training points appear only
inside dot products

Find a1aN such that Q(a) Sai -
½SSaiajyiyjxiTxj is maximized and (1) Saiyi
0 (2) 0 ai C for all ai
f(x) SaiyixiTx b
28
Linear SVM for non-seperable data
29
Non-linear SVM for linearly non-seperable data
30
Non-linear SVMs

Datasets that are linearly separable with some
noise work out great
But what are we going to do if the dataset is
just too hard?
How about mapping data to a higher-dimensional
space

0
x
31
Non-linear SVMs Feature spaces

General idea the original input space can
always be mapped to some higher-dimensional
feature space where the training set is separable

F x ? f(x)
32
Kernel methods approach

The kernel methods approach is to stick with
linear functions but work in a high dimensional
feature space
The expectation is that the feature space has a
much higher dimension than the input space.

33
Kernel methods
data
Identified pattern
kernel
subspace
Pattern Analysis algorithm
34
Mapping
35
Mappping is know as Kernel Functions

Here is a training set in the input space I and
the same training in the feature space F. We go
from a 2D space to a 3D space.

The training set is not linearly separable in the
input space.
The training set is linearly separable in the
feature space. This is called the Kernel Trick.
36
Mappping is know as Kernel Functions

Here is a training set in the input space I and
the same training in the feature space F. We go
from a 2D space to a 3D space.

The training set is not linearly separable in the
input space.
37
Mappping is know as Kernel Functions
The training set is not linearly separable in the
input space.
38
(No Transcript)
39
Classes seperable after mapping
40
Example

Consider the mapping
If we consider a linear equation in this feature
space
We actually have an ellipse i.e. a non-linear
shape in the input space.

41
Capacity of feature spaces

The capacity is proportional to the dimension
for example
2-dim

If data are mapped into a space of sufficiently
high dimension, they will always be linearly
separable (N data points in N-1 dimensions or
more)
Problem Linear separator in space of d
dimensions have d parameters problem of over
fitting
Reason for maximal margin/optimal separator

43
Form of the functions

So kernel methods use linear functions in a
feature space
For regression this could be the function
For classification require thresholding

44
Problems of high dimensions

Capacity may easily become too large and lead to
overfitting being able to realise every
classifier means unlikely to generalise well
Computational costs involved in dealing with
large vectors

45
Kernel Functions

To make the data linearly separable we could
Project the data from the input space to a new
space called feature space
This feature space having more dimensions than
the input space we could separate the data THERE
Using the normal Adatron (linear SVM)

46
Example of polynomial kernel.

r degree polynomial
K(x,x)(1ltx,xgt)d.
For a feature space with two inputs x1,x2 and
a polynomial kernel of degree 2.
K(x,x)(1ltx,xgt)2
Let
and , then
K(x,x)lth(x),h(x)gt.

47
Kernel Functions

Lets use an example projection

The inner product of two vectors x and y
projected in the space F becomes

48
(No Transcript)
49
Kernel Functions

But what happens if, instead of projecting the
data we just do

Which means taking the input space inner
product and squaring it.
It will actually lead to the feature space inner
product!!!

This is much less time consuming as we are
implicitly projecting the training set.
50
Kernel Functions

So we could define a kernel function as follows
It is the function that represents the inner
product of some space in ANOTHER space.
Some spaces are known only by their kernel
function
(ie their projection is UNKNOWN)
Its the case with these kernel functions
Gaussian RBF kernel
Sigmoïd kernel
An experimental kernel called KMOD

51
(No Transcript)
52
Radial Basis Functions
53
Sigmoidal Function
54
The Kernel Trick

The linear classifier relies on dot product
between vectors K(xi,xj)xiTxj
If every data point is mapped into
high-dimensional space via some transformation F
x ? f(x), the dot product becomes
K(xi,xj) f(xi) Tf(xj)
A kernel function is some function that
corresponds to an inner product in some expanded
feature space.
Example
2-dimensional vectors xx1 x2 let
K(xi,xj)(1 xiTxj)2,
Need to show that K(xi,xj) f(xi) Tf(xj)
K(xi,xj)(1 xiTxj)2,
1 xi12xj12 2
xi1xj1 xi2xj2 xi22xj22 2xi1xj1 2xi2xj2
1 xi12 v2 xi1xi2 xi22 v2xi1
v2xi2T 1 xj12 v2 xj1xj2 xj22 v2xj1 v2xj2
f(xi) Tf(xj), where f(x) 1 x12
v2 x1x2 x22 v2x1 v2x2

55
What Functions are Kernels?

For some functions K(xi,xj) checking that
K(xi,xj) f(xi) Tf(xj) can be
cumbersome.
Mercers theorem
Every semi-positive definite symmetric function
is a kernel
Semi-positive definite symmetric functions
correspond to a semi-positive definite symmetric
Gram matrix

K
56
Examples of Kernel Functions

Linear K(xi,xj) xi Txj
Polynomial of power p K(xi,xj) (1 xi Txj)p
Gaussian (radial-basis function network)
Sigmoid K(xi,xj) tanh(ß0xi Txj ß1)

57
Non-linear SVMs Mathematically

Dual problem formulation
The solution is
Optimization techniques for finding ais remain
the same!

Find a1aN such that Q(a) Sai -
½SSaiajyiyjK(xi, xj) is maximized and (1) Saiyi
0 (2) ai 0 for all ai
f(x) SaiyiK(xi, xj) b
58
Nonlinear SVM - Overview

SVM locates a separating hyperplane in the
feature space and classify points in that space
It does not need to represent the space
explicitly, simply by defining a kernel function
The kernel function plays the role of the dot
product in the feature space.

59
Properties of SVM

Flexibility in choosing a similarity function
Sparseness of solution when dealing with large
data sets
- only support vectors are used to specify
the separating hyperplane
Ability to handle large feature spaces
- complexity does not depend on the
dimensionality of the feature space
Overfitting can be controlled by soft margin
approach
Nice math property a simple convex optimization
problem which is guaranteed to converge to a
single global solution
Feature Selection

60
SVM Applications

SVM has been used successfully in many real-world
problems
- text (and hypertext) categorization
- image classification
- bioinformatics (Protein classification,
Cancer classification)
- hand-written character recognition

61
Application 1 Cancer Classification

High Dimensional
- pgt1000 nlt100
Imbalanced
- less positive samples
Many irrelevant features
Noisy

FEATURE SELECTION In the linear case, wi2 gives
the ranking of dim i
SVM is sensitive to noisy (mis-labeled) data ?
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
SVMs

There are many ways to implement this
optimization process.
The Kernel-Adatron is one. The simplest since
its derived from the very well known perceptron.
Its simple but the slowest (awfully, painfully
slow) since it passes through all the examples
MANY times (many epoch)
The first approach used was Quadratic Programming
since this optimization problem is quadratic.
Subject to the complex quadratic programming
theory
Many numerical issues due to the method
An entire QP matrix for a sparse solution
Rather slow as well
Chunking chops the QP matrix in smaller chucks to
gain speed from the sparseness. Does work but
sometimes the improvement is unsignificant.
SMO stands for Sequential Minimal Optimization.
Its Chunking to its finest grain. Using heavy
heuristics, points are optimized by pair.
Very fast
Well documented, see John Platt on Google

77
Kernel function details

The first kernel shown is called the polynomial
kernel

Where n is its order (in our example n2) and b
is called the lower order term (in our example
b0).

Why do we need a lower order term?
Because the origin of the input space matches the
origin of the feature space induced by the
polynomial kernel.
It serves as an offset or a shift for the origin
of the feature space from the input space origin.
When given the choice, its strongly suggested
you use it.

78
Kernel function details

The most popular kernel function (most powerful)
is the Gaussian RBF kernel

Powerful kernel as its effect is to create a
small classification hyperball around an
instance. This kernel doesnt have a projection
formula since its dimension is infinite (you can
create as many balls as you want).
Where s is a measure of the radius of the
hyperball around an instance. You want this
ball to be big enough so hyperballs connect
with each other (pattern recognition) but not too
big to overlap the other class.
79
Kernel function details

Gaussian RBF overfits badly
The Christmas Tree effect.
If you select s too small for the distance
between your examples, you will create small
balls around all your instances.
You will get a 100 accuracy on your training
set, wonderful!
But only your examples are classified (balls are
too small).
In my applet, enter a few examples manually
(using mouse). Make them sparse. Use the Gaussian
RBF kernel with a significantly smaller s than
the proposed values. HERE is YOUR Christmas Tree
your examples created small color balls, like
Christmas tree balls.
The Christmas Tree effect is a joke. But your
classifier is a joke too. No classification of
new instances are possible with it.

80
Karush-Kuhn-Tucker conditions???

Just some conditions that are necessary for a
solution in non-linear programming to be optimal

81
Common Kernels

Polynomial
Radial basis function
Sigmoid

82
Some RBFs
83
(No Transcript)
84
(No Transcript)
85
(No Transcript)
86
Softmargin method

Modified maximum margin idea allows for
mislabeled examples during training
Still creates a hyperplane that separates as
cleanly as possible
The hyperplane maximizes the distance to the
nearest cleanly split examples

87
SVM_light

SVM package
Allows you to choose type (linear vs non-linear)
and/or Kernel function
Two executables svm_learn and svm_classify

88
SVM_light

ltclassgt ltfeature nrgtltfeaturegt
Example
1 10.7 20.3 30.5
-1 11.2 20.6 30.9

89
Mathematical background

The support vector (SV) machine is a new type of
learning machine. It is based on statistical
learning theory.

90
Objective

Use linear support vector machines (SVMs) to
classify 2-D data.

91
Background

Suppose we want to find a decision function f
with the property f(xi) yi, ?i.
(1)
In practice, a separating hyperplane often does
not exist. To allow for the possibility of
examples violating (1), the slack variables ?i
are introduced.
(2)
to get
(3)

92
Background(2)
The SV approach to minimizing the guaranteed risk
bound consists of the following.
Minimize (4) subject to the constraints
(2) and (3). Introducing Lagrange multipliers ?i
and using the Kuhn_Tucker theorem of optimization
theory, the solution can be shown to have an
expansion (5) with nonzero coefficients ?i
only where the corresponding example (xi, yi)
precisely meets the constraint (3). These xi are
called support vectors. All remaining examples of
the training set are irrelevant.
93
Background(3)
The constraint (3) is satisfied automatically
(with ?i 0), and they do not appear in the
expansion (5). The coefficients ?i are found by
solving the following quadratic programming
problem. Maximize (6) subject
to and (7) By linearity of the dot
product, the decision function can be written
as (8)
94
Background (4)
To allow for much more general decision surfaces,
one can first nonlinearly transform a set of
input vectors x1, , xl into a high-dimensional
feature space. The decision function
becomes (9) Where RBF Kernels
are (10)
95
Principal Component Analysis

Given N data vectors from k-dimensions, find c lt
k orthogonal vectors that can be best used to
represent data
The original data set is reduced to one
consisting of N data vectors on c principal
components (reduced dimensions)
Each data vector is a linear combination of the c
principal component vectors
Works for numeric data only
Used when the number of dimensions is large

96
Principal Component Analysis
97
Principal Component Analysis

Aimed at finding new co-ordinate system which has
some characteristics.
M4.5 4.25
Cov Matrix 2.57 1.86
1.86 6.21
Eigen Values 6.99, 1.79
Eigen Vectors 0.387 0.922
-0.922 0.387

98
(No Transcript)
99
However in some cases it is not possible to have
PCA working.
100
Canonical Analysis
101

Unlike PCA which takes global mean and
covariance, this takes between the group and
within the group covariance matrix and the
calculates canonical axes.

102
(No Transcript)
103
Underfitting and overfitting
104
Non-Seperable Trining Sets
105
SVMs works badly with outliers

Write a Comment

User Comments (0)