Introduction to SVMs

About This Presentation

Title:

Introduction to SVMs

Description:

Introduction to SVMs – PowerPoint PPT presentation

Number of Views:172

Avg rating:3.0/5.0

Slides: 86

Provided by: awm9

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to SVMs

1
Introduction to SVMs
2
SVMs

Geometric
Maximizing Margin
Kernel Methods
Making nonlinear decision boundaries linear
Efficiently!
Capacity
Structural Risk Minimization

3
Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
How would you classify this data?
4
Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
How would you classify this data?
5
Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
How would you classify this data?
6
Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
How would you classify this data?
7
Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
Any of these would be fine.. ..but which is best?
8
Classifier Margin
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
Define the margin of a linear classifier as the
width that the boundary could be increased by
before hitting a datapoint.
9
Maximum Margin
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
The maximum margin linear classifier is the
linear classifier with the maximum margin. This
is the simplest kind of SVM (Called an LSVM)
Linear SVM
10
Maximum Margin
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
The maximum margin linear classifier is the
linear classifier with the, um, maximum
margin. This is the simplest kind of SVM (Called
an LSVM)
Support Vectors are those datapoints that the
margin pushes up against
Linear SVM
11
Why Maximum Margin?

Intuitively this feels safest.
If weve made a small error in the location of
the boundary (its been jolted in its
perpendicular direction) this gives us least
chance of causing a misclassification.
Theres some theory (using VC dimension) that is
related to (but not the same as) the proposition
that this is a good thing.
Empirically it works very very well.

f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
The maximum margin linear classifier is the
linear classifier with the, um, maximum
margin. This is the simplest kind of SVM (Called
an LSVM)
Support Vectors are those datapoints that the
margin pushes up against
12
A Good Separator
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
13
Noise in the Observations
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
14
Ruling Out Some Separators
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
15
Lots of Noise
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
16
Maximizing the Margin
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
17
Specifying a line and margin
Plus-Plane
Classifier Boundary
Predict Class 1 zone
Minus-Plane
Predict Class -1 zone

How do we represent this mathematically?
in m input dimensions?

18
Specifying a line and margin
Plus-Plane
Classifier Boundary
Predict Class 1 zone
Minus-Plane
Predict Class -1 zone
wxb1
wxb0
wxb-1

Plus-plane x w . x b 1
Minus-plane x w . x b -1

Classify as..

1 if w . x b gt 1

-1 if w . x b lt -1

Universe explodes if -1 lt w . x b lt 1
19
Computing the margin width
M Margin Width
Predict Class 1 zone
How do we compute M in terms of w and b?
Predict Class -1 zone
wxb1
wxb0
wxb-1

Plus-plane x w . x b 1
Minus-plane x w . x b -1
Claim The vector w is perpendicular to the Plus
Plane. Why?

20
Computing the margin width
M Margin Width
Predict Class 1 zone
How do we compute M in terms of w and b?
Predict Class -1 zone
wxb1
wxb0
wxb-1

Plus-plane x w . x b 1
Minus-plane x w . x b -1
Claim The vector w is perpendicular to the Plus
Plane. Why?

Let u and v be two vectors on the Plus Plane.
What is w . ( u v ) ?
And so of course the vector w is also
perpendicular to the Minus Plane
21
Computing the margin width
M Margin Width
x
Predict Class 1 zone
How do we compute M in terms of w and b?
x-
Predict Class -1 zone
wxb1
wxb0
wxb-1

Plus-plane x w . x b 1
Minus-plane x w . x b -1
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x be the closest plus-plane-point to x-.

22
Computing the margin width
M Margin Width
x
Predict Class 1 zone
How do we compute M in terms of w and b?
x-
Predict Class -1 zone
wxb1
wxb0
wxb-1

Plus-plane x w . x b 1
Minus-plane x w . x b -1
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x be the closest plus-plane-point to x-.
Claim x x- l w for some value of l. Why?

23
Computing the margin width
M Margin Width
x
Predict Class 1 zone
The line from x- to x is perpendicular to the
planes. So to get from x- to x travel some
distance in direction w.
How do we compute M in terms of w and b?
x-
Predict Class -1 zone
wxb1
wxb0
wxb-1

Plus-plane x w . x b 1
Minus-plane x w . x b -1
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x be the closest plus-plane-point to x-.
Claim x x- l w for some value of l. Why?

24
Computing the margin width
M Margin Width
x
Predict Class 1 zone
x-
Predict Class -1 zone
wxb1
wxb0
wxb-1

What we know
w . x b 1
w . x- b -1
x x- l w
x - x- M
Its now easy to get M in terms of w and b

25
Computing the margin width
M Margin Width
x
Predict Class 1 zone
x-
Predict Class -1 zone
wxb1
w . (x - l w) b 1 gt w . x - b l w .w
1 gt -1 l w .w 1 gt
wxb0
wxb-1

What we know
w . x b 1
w . x- b -1
x x- l w
x - x- M
Its now easy to get M in terms of w and b

26
Computing the margin width
M Margin Width
x
Predict Class 1 zone
x-
Predict Class -1 zone
wxb1
M x - x- l w
wxb0
wxb-1

What we know
w . x b 1
w . x- b -1
x x- l w
x - x- M

27
Learning the Maximum Margin Classifier
M Margin Width
x
Predict Class 1 zone
x-
Predict Class -1 zone
wxb1
wxb0
wxb-1

Given a guess of w and b we can
Compute whether all data points in the correct
half-planes
Compute the width of the margin
So now we just need to write a program to search
the space of ws and bs to find the widest
margin that matches all the datapoints. How?
Gradient descent? Simulated Annealing? Matrix
Inversion? EM? Newtons Method?

28
Dont worry its good for you

Linear Programming
find w
argmax c?w
subject to
w?ai ? bi, for i 1, , m
wj ? 0 for j 1, , n

There are fast algorithms for solving linear
programs including the simplex algorithm and
Karmarkars algorithm
29
Learning via Quadratic Programming

QP is a well-studied class of optimization
algorithms to maximize a quadratic function of
some real-valued variables subject to linear
constraints.

30
Quadratic Programming
Quadratic criterion
Find
Subject to
n additional linear inequality constraints
And subject to
e additional linear equality constraints
31
Quadratic Programming
Quadratic criterion
Find
There exist algorithms for finding such
constrained quadratic optima much more
efficiently and reliably than gradient
ascent. (But they are very fiddlyyou probably
dont want to write one yourself)
Subject to
n additional linear inequality constraints
And subject to
e additional linear equality constraints
32
Learning the Maximum Margin Classifier

Given guess of w , b we can
Compute whether all data points are in the
correct half-planes
Compute the margin width
Assume R datapoints, each (xk,yk) where yk /- 1

M
Predict Class 1 zone
Predict Class -1 zone
wxb1
wxb0
wxb-1
What should our quadratic optimization criterion
be?
How many constraints will we have? What should
they be?
R
w . xk b gt 1 if yk 1 w . xk b lt -1 if yk
-1
Minimize w.w
33
Uh-oh!
This is going to be a problem! What should we
do? Idea 1 Find minimum w.w, while minimizing
number of training set errors. Problem Two
things to minimize makes for an ill-defined
optimization
34
Uh-oh!
This is going to be a problem! What should we
do? Idea 1.1 Minimize w.w C (train
errors) Theres a serious practical problem
thats about to make us reject this approach. Can
you guess what it is?
Tradeoff parameter
35
Uh-oh!
This is going to be a problem! What should we
do? Idea 1.1 Minimize w.w C (train
errors) Theres a serious practical problem
thats about to make us reject this approach. Can
you guess what it is?
Tradeoff parameter
Cant be expressed as a Quadratic Programming
problem. Solving it may be too slow. (Also,
doesnt distinguish between disastrous errors and
near misses)
So any other ideas?
36
Uh-oh!
This is going to be a problem! What should we
do? Idea 2.0 Minimize w.w C (distance of
error points to their
correct place)
37
Learning Maximum Margin with Noise

Given guess of w , b we can
Compute sum of distances of points to their
correct zones
Compute the margin width
Assume R datapoints, each (xk,yk) where yk /- 1

M
wxb1
wxb0
wxb-1
What should our quadratic optimization criterion
be?
How many constraints will we have? What should
they be?
38
Large-margin Decision Boundary

The decision boundary should be as far away from
the data of both classes as possible
We should maximize the margin, m
Distance between the origin and the line wtxk is
k/w

Class 2
m
Class 1
39
Finding the Decision Boundary

Let x1, ..., xn be our data set and let yi Î
1,-1 be the class label of xi
The decision boundary should classify all points
correctly Þ
The decision boundary can be found by solving the
following constrained optimization problem
This is a constrained optimization problem.
Solving it requires some new tools
Feel free to ignore the following several slides
what is important is the constrained optimization
problem above

40
Back to the Original Problem

The Lagrangian is
Note that w2 wTw
Setting the gradient of w.r.t. w and b to
zero, we have

The Karush-Kuhn-Tucker conditions,

42
The Dual Problem

If we substitute to ,
we have
Note that
This is a function of ai only

43
The Dual Problem

The new objective function is in terms of ai only
It is known as the dual problem if we know w, we
know all ai if we know all ai, we know w
The original problem is known as the primal
problem
The objective function of the dual problem needs
to be maximized!
The dual problem is therefore

Properties of ai when we introduce the Lagrange
multipliers
The result when we differentiate the original
Lagrangian w.r.t. b
44
The Dual Problem

This is a quadratic programming (QP) problem
A global maximum of ai can always be found
w can be recovered by

45
Characteristics of the Solution

Many of the ai are zero
w is a linear combination of a small number of
data points
This sparse representation can be viewed as
data compression as in the construction of knn
classifier
xi with non-zero ai are called support vectors
(SV)
The decision boundary is determined only by the
SV
Let tj (j1, ..., s) be the indices of the s
support vectors. We can write
For testing with a new data z
Compute
and classify z as class 1 if
the sum is positive, and class 2 otherwise
Note w need not be formed explicitly

46
A Geometrical Interpretation
Class 2
a100
a80.6
a70
a20
a50
a10.8
a40
a61.4
a90
a30
Class 1
47
Non-linearly Separable Problems

We allow error xi in classification it is
based on the output of the discriminant function
wTxb
xi approximates the number of misclassified
samples

48
Learning Maximum Margin with Noise

Given guess of w , b we can
Compute sum of distances of points to their
correct zones
Compute the margin width
Assume R datapoints, each (xk,yk) where yk /- 1

M
e11
e2
wxb1
e7
wxb0
wxb-1
What should our quadratic optimization criterion
be? Minimize
How many constraints will we have? R What should
they be? w . xk b gt 1-ek if yk 1 w . xk b
lt -1ek if yk -1
49
Learning Maximum Margin with Noise
m input dimensions

Given guess of w , b we can
Compute sum of distances of points to their
correct zones
Compute the margin width
Assume R datapoints, each (xk,yk) where yk /- 1

M
e11
e2
Our original (noiseless data) QP had m1
variables w1, w2, wm, and b. Our new (noisy
data) QP has m1R variables w1, w2, wm, b, ek
, e1 , eR
wxb1
e7
wxb0
wxb-1
What should our quadratic optimization criterion
be? Minimize
How many constraints will we have? R What should
they be? w . xk b gt 1-ek if yk 1 w . xk b
lt -1ek if yk -1
R records
50
Learning Maximum Margin with Noise

Given guess of w , b we can
Compute sum of distances of points to their
correct zones
Compute the margin width
Assume R datapoints, each (xk,yk) where yk /- 1

M
e11
e2
wxb1
e7
wxb0
wxb-1
What should our quadratic optimization criterion
be? Minimize
How many constraints will we have? R What should
they be? w . xk b gt 1-ek if yk 1 w . xk b
lt -1ek if yk -1
Theres a bug in this QP. Can you spot it?
51
Learning Maximum Margin with Noise

Given guess of w , b we can
Compute sum of distances of points to their
correct zones
Compute the margin width
Assume R datapoints, each (xk,yk) where yk /- 1

Given guess of w , b we can
Compute sum of distances of points to their
correct zones
Compute the margin width
Assume R datapoints, each (xk,yk) where yk /- 1

M
e11
e2
wxb1
e7
wxb0
wxb-1
How many constraints will we have? 2R What should
they be? w . xk b gt 1-ek if yk 1 w . xk b
lt -1ek if yk -1 ek gt 0 for all k
What should our quadratic optimization criterion
be? Minimize
53
An Equivalent Dual QP
Minimize
w . xk b gt 1-ek if yk 1 w . xk b lt -1ek
if yk -1 ek gt 0, for all k
Maximize
where
Subject to these constraints
54
An Equivalent Dual QP
Maximize
where
Subject to these constraints
Then classify with f(x,w,b) sign(w. x - b)
55
Example XOR problem revisited Let the nonlinear
mapping be f(x) (1,x12, 21/2 x1x2, x22,
21/2 x1 , 21/2 x2)T And f(xi)(1,xi12, 21/2
xi1xi2, xi22, 21/2 xi1 , 21/2 xi2)T Therefore
the feature space is in 6D with input data in 2D
x1 (-1,-1), d1 - 1
x2 (-1, 1), d2 1 x3
( 1,-1), d3 1 x4 (-1,-1), d4
-1
56
Q(a) S ai ½ S S ai aj di dj f(xi) Tf(xj)
a1 a2 a3 a4 ½(9 a1 a1 - 2a1 a2 -2 a1 a3
2a1 a4 9a2 a2 2a2 a3 -2a2 a4 9a3
a3 -2a3 a4 9 a4 a4 ) To minimize Q, we only
need to calculate (due to optimality
conditions) which gives 1 9 a1 - a2 - a3
a4 1 -a1 9 a2 a3 - a4 1 -a1 a2 9
a3 - a4 1 a1 - a2 - a3 9 a4
57
The solution of which gives the optimal values
a0,1 a0,2 a0,3 a0,4 1/8
w0 S a0,i di f(xi) 1/8f(x1)- f(x2)-
f(x3) f(x4)
Where the first element of w0 gives the bias b
58
From earlier we have that the optimal hyperplane
is defined by w0T f(x) 0 That is

w0T f(x)
which is the optimal decision boundary for the
XOR problem. Furthermore we note that the
solution is unique since the optimal decision
boundary is unique
59
Output for polynomial RBF
60
Harder 1-dimensional dataset
Remember how permitting non-linear basis
functions made linear regression so much
nicer? Lets permit them here too
x0
61
For a non-linearly separable problem we have to
first map data onto feature space so that they
are linear separable
xi f(xi) Given
the training data sample (xi,yi), i1, ,N,
find the optimum values of the weight vector w
and bias b w S a0,i yi f(xi) where a0,i
are the optimal Lagrange multipliers determined
by maximizing the following objective function
subject to the constraints S ai yi
0 ai gt0
62

SVM building procedure
Pick a nonlinear mapping f
Solve for the optimal weight vector
However how do we pick the function f?
In practical applications, if it is not totally
impossible to find f, it is very hard
In the previous example, the function f is quite
complex How would we find it?
Answer the Kernel Trick

63
Notice that in the dual problem the image of
input vectors only involved as an inner product
meaning that the optimization can be performed in
the (lower dimensional) input space and that the
inner product can be replaced by an inner-product
kernel How do we relate the output of
the SVM to the kernel K? Look at the equation of
the boundary in the feature space and use the
optimality conditions derived from the Lagrangian
formulations
64
(No Transcript)
65
(No Transcript)
66
In the XOR problem, we chose to use the kernel
function K(x, xi) (x T xi1)2
1 x12 xi12 2 x1x2 xi1xi2
x22 xi22 2x1xi1 , 2x2xi2 Which implied the
form of our nonlinear functions f(x) (1,x12,
21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T And
f(xi)(1,xi12, 21/2 xi1xi2, xi22, 21/2 xi1 , 21/2
xi2)T However, we did not need to calculate f
at all and could simply have used the kernel to
calculate Q(a) S ai ½ S S ai aj di dj
K(xi, xj) Maximized and solved for ai and
derived the hyperplane via
67
We therefore only need a suitable choice of
kernel function cf Mercers Theorem Let
K(x,y) be a continuous symmetric kernel that
defined in the closed interval a,b. The
kernel K can be expanded in the form K (x,y)
f(x) T f(y) provided it is positive definite.
Some of the usual choices for K are Polynomial
SVM (x T xi1)p p specified by user RBF
SVM exp(-1/(2s2) x xi2) s specified by
user MLP SVM tanh(s0 x T xi s1)
68
An Equivalent Dual QP
Maximize
where
Subject to these constraints
Datapoints with ak gt 0 will be the support vectors
Then classify with f(x,w,b) sign(w. x - b)
..so this sum only needs to be over the support
vectors.
69
Quadratic Basis Functions
Constant Term
Linear Terms

Number of terms (assuming m input dimensions)
(m2)-choose-2
(m2)(m1)/2
(as near as makes no difference) m2/2
You may be wondering what those
s are doing.
You should be happy that they do no harm
Youll find out why theyre there soon.

Pure Quadratic Terms
Quadratic Cross-Terms
70
QP with basis functions
where
Subject to these constraints
Then define
Then classify with f(x,w,b) sign(w. f(x) - b)
71
QP with basis functions
where
We must do R2/2 dot products to get this matrix
ready. Each dot product requires m2/2 additions
and multiplications The whole thing costs R2 m2
/4. Yeeks! or does it?
Subject to these constraints
Then define
Then classify with f(x,w,b) sign(w. f(x) - b)
72

Quadratic Dot Products

73
Just out of casual, innocent, interest, lets
look at another function of a and b
Quadratic Dot Products
74
Just out of casual, innocent, interest, lets
look at another function of a and b
Quadratic Dot Products
Theyre the same! And this is only O(m) to
compute!
75
QP with Quadratic basis functions
where
We must do R2/2 dot products to get this matrix
ready. Each dot product now only requires m
additions and multiplications
Subject to these constraints
Then define
Then classify with f(x,w,b) sign(w. f(x) - b)
76
Higher Order Polynomials
Polynomial f(x) Cost to build Qkl matrix traditionally Cost if 100 inputs f(a).f(b) Cost to build Qkl matrix efficiently Cost if 100 inputs
Quadratic All m2/2 terms up to degree 2 m2 R2 /4 2,500 R2 (a.b1)2 m R2 / 2 50 R2
Cubic All m3/6 terms up to degree 3 m3 R2 /12 83,000 R2 (a.b1)3 m R2 / 2 50 R2
Quartic All m4/24 terms up to degree 4 m4 R2 /48 1,960,000 R2 (a.b1)4 m R2 / 2 50 R2
77
QP with Quintic basis functions
Maximize
where
Subject to these constraints
Then define
Then classify with f(x,w,b) sign(w. f(x) - b)
78
QP with Quintic basis functions
We must do R2/2 dot products to get this matrix
ready. In 100-d, each dot product now needs 103
operations instead of 75 million But there are
still worrying things lurking away. What are they?
Maximize
where
Subject to these constraints

The fear of overfitting with this enormous number
of terms
The evaluation phase (doing a set of predictions
on a test set) will be very expensive (why?)

Then define
Then classify with f(x,w,b) sign(w. f(x) - b)
79
QP with Quintic basis functions
We must do R2/2 dot products to get this matrix
ready. In 100-d, each dot product now needs 103
operations instead of 75 million But there are
still worrying things lurking away. What are they?
Maximize
where
The use of Maximum Margin magically makes this
not a problem
Subject to these constraints

The fear of overfitting with this enormous number
of terms
The evaluation phase (doing a set of predictions
on a test set) will be very expensive (why?)

Then define
Because each w. f(x) (see below) needs 75 million
operations. What can be done?
Then classify with f(x,w,b) sign(w. f(x) - b)
80
QP with Quintic basis functions
We must do R2/2 dot products to get this matrix
ready. In 100-d, each dot product now needs 103
operations instead of 75 million But there are
still worrying things lurking away. What are they?
Maximize
where
The use of Maximum Margin magically makes this
not a problem
Subject to these constraints

The fear of overfitting with this enormous number
of terms
The evaluation phase (doing a set of predictions
on a test set) will be very expensive (why?)

Then define
Because each w. f(x) (see below) needs 75 million
operations. What can be done?
Only Sm operations (Ssupport vectors)
Then classify with f(x,w,b) sign(w. f(x) - b)
81
QP with Quintic basis functions
We must do R2/2 dot products to get this matrix
ready. In 100-d, each dot product now needs 103
operations instead of 75 million But there are
still worrying things lurking away. What are they?
Maximize
where
The use of Maximum Margin magically makes this
not a problem
Subject to these constraints

The fear of overfitting with this enormous number
of terms
The evaluation phase (doing a set of predictions
on a test set) will be very expensive (why?)

Then define
Because each w. f(x) (see below) needs 75 million
operations. What can be done?
Only Sm operations (Ssupport vectors)
Then classify with f(x,w,b) sign(w. f(x) - b)
82
QP with Quintic basis functions
where
Why SVMs dont overfit as much as youd think No
matter what the basis function, there are really
only up to R parameters a1, a2 .. aR, and
usually most are set to zero by the Maximum
Margin. Asking for small w.w is like weight
decay in Neural Nets and like Ridge Regression
parameters in Linear regression and like the use
of Priors in Bayesian Regression---all designed
to smooth the function and reduce overfitting.
Subject to these constraints
Then define
Only Sm operations (Ssupport vectors)
Then classify with f(x,w,b) sign(w. f(x) - b)
83
SVM Kernel Functions

K(a,b)(a . b 1)d is an example of an SVM Kernel
Function
Beyond polynomials there are other very high
dimensional basis functions that can be made
practical by finding the right Kernel Function
Radial-Basis-style Kernel Function
Neural-net-style Kernel Function

s, k and d are magic parameters that must be
chosen by a model selection method such as CV or
VCSRM
84
SVM Implementations

Sequential Minimal Optimization, SMO, efficient
implementation of SVMs, Platt
in Weka
SVMlight
http//svmlight.joachims.org/

85
References

Tutorial on VC-dimension and Support Vector
Machines
C. Burges. A tutorial on support vector machines
for pattern recognition. Data Mining and
Knowledge Discovery, 2(2)955-974, 1998.
http//citeseer.nj.nec.com/burges98tutorial.html
The VC/SRM/SVM Bible
Statistical Learning Theory by Vladimir Vapnik,
Wiley-Interscience 1998

Write a Comment

User Comments (0)

About PowerShow.com

Introduction to SVMs - PowerPoint PPT Presentation

Introduction to SVMs

Introduction to SVMs – PowerPoint PPT presentation