Title: Introduction to SVMs
1Introduction to SVMs
2SVMs
- Geometric
- Maximizing Margin
- Kernel Methods
- Making nonlinear decision boundaries linear
- Efficiently!
- Capacity
- Structural Risk Minimization
3 Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
How would you classify this data?
4 Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
How would you classify this data?
5 Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
How would you classify this data?
6 Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
How would you classify this data?
7 Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
Any of these would be fine.. ..but which is best?
8Classifier Margin
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
Define the margin of a linear classifier as the
width that the boundary could be increased by
before hitting a datapoint.
9Maximum Margin
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
The maximum margin linear classifier is the
linear classifier with the maximum margin. This
is the simplest kind of SVM (Called an LSVM)
Linear SVM
10Maximum Margin
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
The maximum margin linear classifier is the
linear classifier with the, um, maximum
margin. This is the simplest kind of SVM (Called
an LSVM)
Support Vectors are those datapoints that the
margin pushes up against
Linear SVM
11Why Maximum Margin?
- Intuitively this feels safest.
- If weve made a small error in the location of
the boundary (its been jolted in its
perpendicular direction) this gives us least
chance of causing a misclassification. - Theres some theory (using VC dimension) that is
related to (but not the same as) the proposition
that this is a good thing. - Empirically it works very very well.
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
The maximum margin linear classifier is the
linear classifier with the, um, maximum
margin. This is the simplest kind of SVM (Called
an LSVM)
Support Vectors are those datapoints that the
margin pushes up against
12A Good Separator
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
13Noise in the Observations
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
14Ruling Out Some Separators
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
15Lots of Noise
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
16Maximizing the Margin
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
17Specifying a line and margin
Plus-Plane
Classifier Boundary
Predict Class 1 zone
Minus-Plane
Predict Class -1 zone
- How do we represent this mathematically?
- in m input dimensions?
18Specifying a line and margin
Plus-Plane
Classifier Boundary
Predict Class 1 zone
Minus-Plane
Predict Class -1 zone
wxb1
wxb0
wxb-1
- Plus-plane x w . x b 1
- Minus-plane x w . x b -1
Classify as..
1 if w . x b gt 1
-1 if w . x b lt -1
Universe explodes if -1 lt w . x b lt 1
19Computing the margin width
M Margin Width
Predict Class 1 zone
How do we compute M in terms of w and b?
Predict Class -1 zone
wxb1
wxb0
wxb-1
- Plus-plane x w . x b 1
- Minus-plane x w . x b -1
- Claim The vector w is perpendicular to the Plus
Plane. Why?
20Computing the margin width
M Margin Width
Predict Class 1 zone
How do we compute M in terms of w and b?
Predict Class -1 zone
wxb1
wxb0
wxb-1
- Plus-plane x w . x b 1
- Minus-plane x w . x b -1
- Claim The vector w is perpendicular to the Plus
Plane. Why?
Let u and v be two vectors on the Plus Plane.
What is w . ( u v ) ?
And so of course the vector w is also
perpendicular to the Minus Plane
21Computing the margin width
M Margin Width
x
Predict Class 1 zone
How do we compute M in terms of w and b?
x-
Predict Class -1 zone
wxb1
wxb0
wxb-1
- Plus-plane x w . x b 1
- Minus-plane x w . x b -1
- The vector w is perpendicular to the Plus Plane
- Let x- be any point on the minus plane
- Let x be the closest plus-plane-point to x-.
22Computing the margin width
M Margin Width
x
Predict Class 1 zone
How do we compute M in terms of w and b?
x-
Predict Class -1 zone
wxb1
wxb0
wxb-1
- Plus-plane x w . x b 1
- Minus-plane x w . x b -1
- The vector w is perpendicular to the Plus Plane
- Let x- be any point on the minus plane
- Let x be the closest plus-plane-point to x-.
- Claim x x- l w for some value of l. Why?
23Computing the margin width
M Margin Width
x
Predict Class 1 zone
The line from x- to x is perpendicular to the
planes. So to get from x- to x travel some
distance in direction w.
How do we compute M in terms of w and b?
x-
Predict Class -1 zone
wxb1
wxb0
wxb-1
- Plus-plane x w . x b 1
- Minus-plane x w . x b -1
- The vector w is perpendicular to the Plus Plane
- Let x- be any point on the minus plane
- Let x be the closest plus-plane-point to x-.
- Claim x x- l w for some value of l. Why?
24Computing the margin width
M Margin Width
x
Predict Class 1 zone
x-
Predict Class -1 zone
wxb1
wxb0
wxb-1
- What we know
- w . x b 1
- w . x- b -1
- x x- l w
- x - x- M
- Its now easy to get M in terms of w and b
25Computing the margin width
M Margin Width
x
Predict Class 1 zone
x-
Predict Class -1 zone
wxb1
w . (x - l w) b 1 gt w . x - b l w .w
1 gt -1 l w .w 1 gt
wxb0
wxb-1
- What we know
- w . x b 1
- w . x- b -1
- x x- l w
- x - x- M
- Its now easy to get M in terms of w and b
26Computing the margin width
M Margin Width
x
Predict Class 1 zone
x-
Predict Class -1 zone
wxb1
M x - x- l w
wxb0
wxb-1
- What we know
- w . x b 1
- w . x- b -1
- x x- l w
- x - x- M
-
27Learning the Maximum Margin Classifier
M Margin Width
x
Predict Class 1 zone
x-
Predict Class -1 zone
wxb1
wxb0
wxb-1
- Given a guess of w and b we can
- Compute whether all data points in the correct
half-planes - Compute the width of the margin
- So now we just need to write a program to search
the space of ws and bs to find the widest
margin that matches all the datapoints. How? - Gradient descent? Simulated Annealing? Matrix
Inversion? EM? Newtons Method?
28Dont worry its good for you
- Linear Programming
- find w
- argmax c?w
- subject to
- w?ai ? bi, for i 1, , m
- wj ? 0 for j 1, , n
There are fast algorithms for solving linear
programs including the simplex algorithm and
Karmarkars algorithm
29Learning via Quadratic Programming
- QP is a well-studied class of optimization
algorithms to maximize a quadratic function of
some real-valued variables subject to linear
constraints.
30Quadratic Programming
Quadratic criterion
Find
Subject to
n additional linear inequality constraints
And subject to
e additional linear equality constraints
31Quadratic Programming
Quadratic criterion
Find
There exist algorithms for finding such
constrained quadratic optima much more
efficiently and reliably than gradient
ascent. (But they are very fiddlyyou probably
dont want to write one yourself)
Subject to
n additional linear inequality constraints
And subject to
e additional linear equality constraints
32Learning the Maximum Margin Classifier
- Given guess of w , b we can
- Compute whether all data points are in the
correct half-planes - Compute the margin width
- Assume R datapoints, each (xk,yk) where yk /- 1
M
Predict Class 1 zone
Predict Class -1 zone
wxb1
wxb0
wxb-1
What should our quadratic optimization criterion
be?
How many constraints will we have? What should
they be?
R
w . xk b gt 1 if yk 1 w . xk b lt -1 if yk
-1
Minimize w.w
33Uh-oh!
This is going to be a problem! What should we
do? Idea 1 Find minimum w.w, while minimizing
number of training set errors. Problem Two
things to minimize makes for an ill-defined
optimization
34Uh-oh!
This is going to be a problem! What should we
do? Idea 1.1 Minimize w.w C (train
errors) Theres a serious practical problem
thats about to make us reject this approach. Can
you guess what it is?
Tradeoff parameter
35Uh-oh!
This is going to be a problem! What should we
do? Idea 1.1 Minimize w.w C (train
errors) Theres a serious practical problem
thats about to make us reject this approach. Can
you guess what it is?
Tradeoff parameter
Cant be expressed as a Quadratic Programming
problem. Solving it may be too slow. (Also,
doesnt distinguish between disastrous errors and
near misses)
So any other ideas?
36Uh-oh!
This is going to be a problem! What should we
do? Idea 2.0 Minimize w.w C (distance of
error points to their
correct place)
37Learning Maximum Margin with Noise
- Given guess of w , b we can
- Compute sum of distances of points to their
correct zones - Compute the margin width
- Assume R datapoints, each (xk,yk) where yk /- 1
M
wxb1
wxb0
wxb-1
What should our quadratic optimization criterion
be?
How many constraints will we have? What should
they be?
38Large-margin Decision Boundary
- The decision boundary should be as far away from
the data of both classes as possible - We should maximize the margin, m
- Distance between the origin and the line wtxk is
k/w
Class 2
m
Class 1
39Finding the Decision Boundary
- Let x1, ..., xn be our data set and let yi ÃŽ
1,-1 be the class label of xi - The decision boundary should classify all points
correctly Þ - The decision boundary can be found by solving the
following constrained optimization problem - This is a constrained optimization problem.
Solving it requires some new tools - Feel free to ignore the following several slides
what is important is the constrained optimization
problem above
40Back to the Original Problem
- The Lagrangian is
- Note that w2 wTw
- Setting the gradient of w.r.t. w and b to
zero, we have
41- The Karush-Kuhn-Tucker conditions,
42The Dual Problem
- If we substitute to ,
we have - Note that
- This is a function of ai only
43The Dual Problem
- The new objective function is in terms of ai only
- It is known as the dual problem if we know w, we
know all ai if we know all ai, we know w - The original problem is known as the primal
problem - The objective function of the dual problem needs
to be maximized! - The dual problem is therefore
Properties of ai when we introduce the Lagrange
multipliers
The result when we differentiate the original
Lagrangian w.r.t. b
44The Dual Problem
- This is a quadratic programming (QP) problem
- A global maximum of ai can always be found
- w can be recovered by
45Characteristics of the Solution
- Many of the ai are zero
- w is a linear combination of a small number of
data points - This sparse representation can be viewed as
data compression as in the construction of knn
classifier - xi with non-zero ai are called support vectors
(SV) - The decision boundary is determined only by the
SV - Let tj (j1, ..., s) be the indices of the s
support vectors. We can write - For testing with a new data z
- Compute
and classify z as class 1 if
the sum is positive, and class 2 otherwise - Note w need not be formed explicitly
46A Geometrical Interpretation
Class 2
a100
a80.6
a70
a20
a50
a10.8
a40
a61.4
a90
a30
Class 1
47Non-linearly Separable Problems
- We allow error xi in classification it is
based on the output of the discriminant function
wTxb - xi approximates the number of misclassified
samples
48Learning Maximum Margin with Noise
- Given guess of w , b we can
- Compute sum of distances of points to their
correct zones - Compute the margin width
- Assume R datapoints, each (xk,yk) where yk /- 1
M
e11
e2
wxb1
e7
wxb0
wxb-1
What should our quadratic optimization criterion
be? Minimize
How many constraints will we have? R What should
they be? w . xk b gt 1-ek if yk 1 w . xk b
lt -1ek if yk -1
49Learning Maximum Margin with Noise
m input dimensions
- Given guess of w , b we can
- Compute sum of distances of points to their
correct zones - Compute the margin width
- Assume R datapoints, each (xk,yk) where yk /- 1
M
e11
e2
Our original (noiseless data) QP had m1
variables w1, w2, wm, and b. Our new (noisy
data) QP has m1R variables w1, w2, wm, b, ek
, e1 , eR
wxb1
e7
wxb0
wxb-1
What should our quadratic optimization criterion
be? Minimize
How many constraints will we have? R What should
they be? w . xk b gt 1-ek if yk 1 w . xk b
lt -1ek if yk -1
R records
50Learning Maximum Margin with Noise
- Given guess of w , b we can
- Compute sum of distances of points to their
correct zones - Compute the margin width
- Assume R datapoints, each (xk,yk) where yk /- 1
M
e11
e2
wxb1
e7
wxb0
wxb-1
What should our quadratic optimization criterion
be? Minimize
How many constraints will we have? R What should
they be? w . xk b gt 1-ek if yk 1 w . xk b
lt -1ek if yk -1
Theres a bug in this QP. Can you spot it?
51Learning Maximum Margin with Noise
- Given guess of w , b we can
- Compute sum of distances of points to their
correct zones - Compute the margin width
- Assume R datapoints, each (xk,yk) where yk /- 1
M
e11
e2
wxb1
e7
wxb0
wxb-1
How many constraints will we have? 2R What should
they be? w . xk b gt 1-ek if yk 1 w . xk b
lt -1ek if yk -1 ek gt 0 for all k
What should our quadratic optimization criterion
be? Minimize
52Learning Maximum Margin with Noise
- Given guess of w , b we can
- Compute sum of distances of points to their
correct zones - Compute the margin width
- Assume R datapoints, each (xk,yk) where yk /- 1
M
e11
e2
wxb1
e7
wxb0
wxb-1
How many constraints will we have? 2R What should
they be? w . xk b gt 1-ek if yk 1 w . xk b
lt -1ek if yk -1 ek gt 0 for all k
What should our quadratic optimization criterion
be? Minimize
53An Equivalent Dual QP
Minimize
w . xk b gt 1-ek if yk 1 w . xk b lt -1ek
if yk -1 ek gt 0, for all k
Maximize
where
Subject to these constraints
54An Equivalent Dual QP
Maximize
where
Subject to these constraints
Then classify with f(x,w,b) sign(w. x - b)
55Example XOR problem revisited Let the nonlinear
mapping be f(x) (1,x12, 21/2 x1x2, x22,
21/2 x1 , 21/2 x2)T And f(xi)(1,xi12, 21/2
xi1xi2, xi22, 21/2 xi1 , 21/2 xi2)T Therefore
the feature space is in 6D with input data in 2D
x1 (-1,-1), d1 - 1
x2 (-1, 1), d2 1 x3
( 1,-1), d3 1 x4 (-1,-1), d4
-1
56Q(a) S ai ½ S S ai aj di dj f(xi) Tf(xj)
a1 a2 a3 a4 ½(9 a1 a1 - 2a1 a2 -2 a1 a3
2a1 a4 9a2 a2 2a2 a3 -2a2 a4 9a3
a3 -2a3 a4 9 a4 a4 ) To minimize Q, we only
need to calculate (due to optimality
conditions) which gives 1 9 a1 - a2 - a3
a4 1 -a1 9 a2 a3 - a4 1 -a1 a2 9
a3 - a4 1 a1 - a2 - a3 9 a4
57The solution of which gives the optimal values
a0,1 a0,2 a0,3 a0,4 1/8
w0 S a0,i di f(xi) 1/8f(x1)- f(x2)-
f(x3) f(x4)
Where the first element of w0 gives the bias b
58From earlier we have that the optimal hyperplane
is defined by w0T f(x) 0 That is
w0T f(x)
which is the optimal decision boundary for the
XOR problem. Furthermore we note that the
solution is unique since the optimal decision
boundary is unique
59Output for polynomial RBF
60Harder 1-dimensional dataset
Remember how permitting non-linear basis
functions made linear regression so much
nicer? Lets permit them here too
x0
61For a non-linearly separable problem we have to
first map data onto feature space so that they
are linear separable
xi f(xi) Given
the training data sample (xi,yi), i1, ,N,
find the optimum values of the weight vector w
and bias b w S a0,i yi f(xi) where a0,i
are the optimal Lagrange multipliers determined
by maximizing the following objective function
subject to the constraints S ai yi
0 ai gt0
62- SVM building procedure
- Pick a nonlinear mapping f
- Solve for the optimal weight vector
- However how do we pick the function f?
- In practical applications, if it is not totally
impossible to find f, it is very hard - In the previous example, the function f is quite
complex How would we find it? - Answer the Kernel Trick
63Notice that in the dual problem the image of
input vectors only involved as an inner product
meaning that the optimization can be performed in
the (lower dimensional) input space and that the
inner product can be replaced by an inner-product
kernel How do we relate the output of
the SVM to the kernel K? Look at the equation of
the boundary in the feature space and use the
optimality conditions derived from the Lagrangian
formulations
64(No Transcript)
65(No Transcript)
66In the XOR problem, we chose to use the kernel
function K(x, xi) (x T xi1)2
1 x12 xi12 2 x1x2 xi1xi2
x22 xi22 2x1xi1 , 2x2xi2 Which implied the
form of our nonlinear functions f(x) (1,x12,
21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T And
f(xi)(1,xi12, 21/2 xi1xi2, xi22, 21/2 xi1 , 21/2
xi2)T However, we did not need to calculate f
at all and could simply have used the kernel to
calculate Q(a) S ai ½ S S ai aj di dj
K(xi, xj) Maximized and solved for ai and
derived the hyperplane via
67We therefore only need a suitable choice of
kernel function cf Mercers Theorem Let
K(x,y) be a continuous symmetric kernel that
defined in the closed interval a,b. The
kernel K can be expanded in the form K (x,y)
f(x) T f(y) provided it is positive definite.
Some of the usual choices for K are Polynomial
SVM (x T xi1)p p specified by user RBF
SVM exp(-1/(2s2) x xi2) s specified by
user MLP SVM tanh(s0 x T xi s1)
68An Equivalent Dual QP
Maximize
where
Subject to these constraints
Datapoints with ak gt 0 will be the support vectors
Then classify with f(x,w,b) sign(w. x - b)
..so this sum only needs to be over the support
vectors.
69Quadratic Basis Functions
Constant Term
Linear Terms
- Number of terms (assuming m input dimensions)
(m2)-choose-2 - (m2)(m1)/2
- (as near as makes no difference) m2/2
- You may be wondering what those
- s are doing.
- You should be happy that they do no harm
- Youll find out why theyre there soon.
Pure Quadratic Terms
Quadratic Cross-Terms
70QP with basis functions
where
Subject to these constraints
Then define
Then classify with f(x,w,b) sign(w. f(x) - b)
71QP with basis functions
where
We must do R2/2 dot products to get this matrix
ready. Each dot product requires m2/2 additions
and multiplications The whole thing costs R2 m2
/4. Yeeks! or does it?
Subject to these constraints
Then define
Then classify with f(x,w,b) sign(w. f(x) - b)
72Quadratic Dot Products
73Just out of casual, innocent, interest, lets
look at another function of a and b
Quadratic Dot Products
74Just out of casual, innocent, interest, lets
look at another function of a and b
Quadratic Dot Products
Theyre the same! And this is only O(m) to
compute!
75QP with Quadratic basis functions
where
We must do R2/2 dot products to get this matrix
ready. Each dot product now only requires m
additions and multiplications
Subject to these constraints
Then define
Then classify with f(x,w,b) sign(w. f(x) - b)
76Higher Order Polynomials
Polynomial f(x) Cost to build Qkl matrix traditionally Cost if 100 inputs f(a).f(b) Cost to build Qkl matrix efficiently Cost if 100 inputs
Quadratic All m2/2 terms up to degree 2 m2 R2 /4 2,500 R2 (a.b1)2 m R2 / 2 50 R2
Cubic All m3/6 terms up to degree 3 m3 R2 /12 83,000 R2 (a.b1)3 m R2 / 2 50 R2
Quartic All m4/24 terms up to degree 4 m4 R2 /48 1,960,000 R2 (a.b1)4 m R2 / 2 50 R2
77QP with Quintic basis functions
Maximize
where
Subject to these constraints
Then define
Then classify with f(x,w,b) sign(w. f(x) - b)
78QP with Quintic basis functions
We must do R2/2 dot products to get this matrix
ready. In 100-d, each dot product now needs 103
operations instead of 75 million But there are
still worrying things lurking away. What are they?
Maximize
where
Subject to these constraints
- The fear of overfitting with this enormous number
of terms - The evaluation phase (doing a set of predictions
on a test set) will be very expensive (why?)
Then define
Then classify with f(x,w,b) sign(w. f(x) - b)
79QP with Quintic basis functions
We must do R2/2 dot products to get this matrix
ready. In 100-d, each dot product now needs 103
operations instead of 75 million But there are
still worrying things lurking away. What are they?
Maximize
where
The use of Maximum Margin magically makes this
not a problem
Subject to these constraints
- The fear of overfitting with this enormous number
of terms - The evaluation phase (doing a set of predictions
on a test set) will be very expensive (why?)
Then define
Because each w. f(x) (see below) needs 75 million
operations. What can be done?
Then classify with f(x,w,b) sign(w. f(x) - b)
80QP with Quintic basis functions
We must do R2/2 dot products to get this matrix
ready. In 100-d, each dot product now needs 103
operations instead of 75 million But there are
still worrying things lurking away. What are they?
Maximize
where
The use of Maximum Margin magically makes this
not a problem
Subject to these constraints
- The fear of overfitting with this enormous number
of terms - The evaluation phase (doing a set of predictions
on a test set) will be very expensive (why?)
Then define
Because each w. f(x) (see below) needs 75 million
operations. What can be done?
Only Sm operations (Ssupport vectors)
Then classify with f(x,w,b) sign(w. f(x) - b)
81QP with Quintic basis functions
We must do R2/2 dot products to get this matrix
ready. In 100-d, each dot product now needs 103
operations instead of 75 million But there are
still worrying things lurking away. What are they?
Maximize
where
The use of Maximum Margin magically makes this
not a problem
Subject to these constraints
- The fear of overfitting with this enormous number
of terms - The evaluation phase (doing a set of predictions
on a test set) will be very expensive (why?)
Then define
Because each w. f(x) (see below) needs 75 million
operations. What can be done?
Only Sm operations (Ssupport vectors)
Then classify with f(x,w,b) sign(w. f(x) - b)
82QP with Quintic basis functions
where
Why SVMs dont overfit as much as youd think No
matter what the basis function, there are really
only up to R parameters a1, a2 .. aR, and
usually most are set to zero by the Maximum
Margin. Asking for small w.w is like weight
decay in Neural Nets and like Ridge Regression
parameters in Linear regression and like the use
of Priors in Bayesian Regression---all designed
to smooth the function and reduce overfitting.
Subject to these constraints
Then define
Only Sm operations (Ssupport vectors)
Then classify with f(x,w,b) sign(w. f(x) - b)
83SVM Kernel Functions
- K(a,b)(a . b 1)d is an example of an SVM Kernel
Function - Beyond polynomials there are other very high
dimensional basis functions that can be made
practical by finding the right Kernel Function - Radial-Basis-style Kernel Function
- Neural-net-style Kernel Function
s, k and d are magic parameters that must be
chosen by a model selection method such as CV or
VCSRM
84SVM Implementations
- Sequential Minimal Optimization, SMO, efficient
implementation of SVMs, Platt - in Weka
- SVMlight
- http//svmlight.joachims.org/
85References
- Tutorial on VC-dimension and Support Vector
Machines - C. Burges. A tutorial on support vector machines
for pattern recognition. Data Mining and
Knowledge Discovery, 2(2)955-974, 1998.
http//citeseer.nj.nec.com/burges98tutorial.html - The VC/SRM/SVM Bible
- Statistical Learning Theory by Vladimir Vapnik,
Wiley-Interscience 1998