Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

About This Presentation

Title:

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Description:

Title: CS479/679 Pattern Recognition Spring 2006 Prof. Bebis Last modified by: George Bebis Document presentation format: On-screen Show (4:3) –

Number of Views:63

Avg rating:3.0/5.0

Slides: 38

Provided by: cseUnrEd

Learn more at: https://www.cse.unr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

1
Support Vector Machines (SVMs) Chapter 5 (Duda
et al.)
CS479/679 Pattern RecognitionDr. George Bebis
2
Learning through empirical risk minimization

Estimate g(x) from a finite set of observations
by minimizing an error function, for example,
the training error (also called empirical risk)

class labels
3
Learning through empirical risk minimization
(contd)

Conventional empirical risk minimization does not
imply good generalization performance.
There could be several different functions g(x)
which all approximate the training data set well.
Difficult to determine which function would have
the best generalization performance.

4
Learning through empirical risk minimization
(contd)

Solution 1
Solution 2
Which solution is better?
5
Statistical LearningCapacity and VC dimension

To guarantee good generalization performance, the
capacity (i.e., complexity) of the learned
functions must be controlled.
Functions with high capacity are more complicated
(i.e., have many degrees of freedom).

high capacity
low capacity
6
Statistical LearningCapacity and VC dimension
(contd)

How do we measure capacity?
In statistical learning, the Vapnik-Chervonenkis
(VC) dimension is a popular measure of capacity.
The VC dimension can predict a probabilistic
upper bound on the generalization error of a
classifier.

7
Statistical LearningCapacity and VC dimension
(contd)

A function that
(1) minimizes the empirical risk and
(2) has low VC dimension
will generalize well regardless of the
dimensionality of the input space
with probability (1-d) (n of training
examples)
(Vapnik, 1995, Structural Risk
Minimization Principle)

structural risk minimization
8
VC dimension and margin of separation

Vapnik has shown that maximizing the margin of
separation (i.e., empty space between classes) is
equivalent to minimizing the VC dimension.
The optimal hyperplane is the one giving the
largest margin of separation between the classes.

9
Margin of separation and support vectors

How is the margin defined?
The margin is defined by the distance of the
nearest training samples from the hyperplane.
We refer to these samples as support vectors.
Intuitively speaking, these are the most
difficult samples to classify.

10
Margin of separation andsupport vectors (contd)

different solutions
corresponding margins
11
SVM Overview

Primarily two-class classifiers but can be
extended to multiple classes.
It performs structural risk minimization to
achieve good generalization performance.
The optimization criterion is the margin of
separation between classes.
Training is equivalent to solving a quadratic
programming problem with linear constraints.

12
Linear SVM separable case

Linear discriminant
Class labels
Consider the equivalent problem

Decide ?1 if g(x) gt 0 and ?2 if g(x) lt 0
13
Linear SVM separable case (contd)

The distance of a point xk from the separating
hyperplane should satisfy the constraint
To constraint the length of w (uniqueness), we
impose
Using the above constraint

14
Linear SVM separable case (contd)
quadratic programming problem
maximize margin

15
Linear SVM separable case (contd)

Using Langrange optimization, minimize
Easier to solve the dual problem (Kuhn-Tucker
construction)

16
Linear SVM separable case (contd)

The solution is given by

dot product
17
Linear SVM separable case (contd)
dot product

It can be shown that if xk is not a support
vector, then the corresponding ?k0.

Only the support vectors contribute to the
solution!
18
Linear SVM non-separable case

Allow miss-classifications (i.e., soft margin
classifier) by introducing positive error (slack)
variables ?k

19
Linear SVM non-separable case (contd)

The constant c controls the trade-off between
margin and misclassification errors.
Aims to prevent outliers from affecting the
optimal hyperplane.

20
Linear SVM non-separable case (contd)

Easier to solve the dual problem (Kuhn-Tucker
construction)

21
Nonlinear SVM

Extending these concepts to the non-linear case
involves mapping the data to a high-dimensional
space h
Mapping the data to a sufficiently high
dimensional space is likely to cast the data
linearly separable in that space.

22
Nonlinear SVM (contd)
Example

23
Nonlinear SVM (contd)
linear SVM
non-linear SVM

24
Nonlinear SVM (contd)

The disadvantage of this approach is that the
mapping
might be very computationally intensive to
compute!
Is there an efficient way to compute
?

non-linear SVM

25
The kernel trick

Compute dot products using a kernel function

26
The kernel trick (contd)

Comments
Kernel functions which can be expressed as a dot
product in some space satisfy the Mercers
condition (see Burges paper)
The Mercers condition does not tell us how to
construct F() or even what the high dimensional
space is.
Advantages of kernel trick
No need to know F()
Computations remain feasible even if the feature
space has high dimensionality.

27
Polynomial Kernel
K(x,y)(x . y) d

28
Polynomial Kernel - Example

29
Common Kernel functions

30
Example

31
Example (contd)

h6
32
Example (contd)

33
Example (contd)
(Problem 4)

34
Example (contd)

35
Example (contd)
36
Example (contd)

37
Comments

SVM is based on exact optimization, not on
approximate methods (i.e., global optimization
method, no local optima)
Appears to avoid overfitting in high dimensional
spaces and generalize well using a small training
set.
Performance depends on the choice of the kernel
and its parameters.
Its complexity depends on the number of support
vectors, not on the dimensionality of the
transformed space.