Title: Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
1Support Vector Machines (SVMs) Chapter 5 (Duda
et al.)
CS479/679 Pattern RecognitionDr. George Bebis
2Learning through empirical risk minimization
- Estimate g(x) from a finite set of observations
by minimizing an error function, for example,
the training error (also called empirical risk) -
class labels
3Learning through empirical risk minimization
(contd)
- Conventional empirical risk minimization does not
imply good generalization performance. - There could be several different functions g(x)
which all approximate the training data set well. - Difficult to determine which function would have
the best generalization performance.
4Learning through empirical risk minimization
(contd)
Solution 1
Solution 2
Which solution is better?
5Statistical LearningCapacity and VC dimension
- To guarantee good generalization performance, the
capacity (i.e., complexity) of the learned
functions must be controlled. - Functions with high capacity are more complicated
(i.e., have many degrees of freedom).
high capacity
low capacity
6Statistical LearningCapacity and VC dimension
(contd)
- How do we measure capacity?
- In statistical learning, the Vapnik-Chervonenkis
(VC) dimension is a popular measure of capacity. - The VC dimension can predict a probabilistic
upper bound on the generalization error of a
classifier.
7Statistical LearningCapacity and VC dimension
(contd)
- A function that
- (1) minimizes the empirical risk and
- (2) has low VC dimension
- will generalize well regardless of the
dimensionality of the input space -
- with probability (1-d) (n of training
examples) - (Vapnik, 1995, Structural Risk
Minimization Principle)
structural risk minimization
8VC dimension and margin of separation
- Vapnik has shown that maximizing the margin of
separation (i.e., empty space between classes) is
equivalent to minimizing the VC dimension. - The optimal hyperplane is the one giving the
largest margin of separation between the classes.
9Margin of separation and support vectors
- How is the margin defined?
- The margin is defined by the distance of the
nearest training samples from the hyperplane. - We refer to these samples as support vectors.
- Intuitively speaking, these are the most
difficult samples to classify.
10Margin of separation andsupport vectors (contd)
different solutions
corresponding margins
11SVM Overview
- Primarily two-class classifiers but can be
extended to multiple classes. - It performs structural risk minimization to
achieve good generalization performance. - The optimization criterion is the margin of
separation between classes. - Training is equivalent to solving a quadratic
programming problem with linear constraints.
12Linear SVM separable case
- Linear discriminant
- Class labels
- Consider the equivalent problem
Decide ?1 if g(x) gt 0 and ?2 if g(x) lt 0
13Linear SVM separable case (contd)
- The distance of a point xk from the separating
hyperplane should satisfy the constraint - To constraint the length of w (uniqueness), we
impose - Using the above constraint
14Linear SVM separable case (contd)
quadratic programming problem
maximize margin
15Linear SVM separable case (contd)
- Using Langrange optimization, minimize
- Easier to solve the dual problem (Kuhn-Tucker
construction)
16Linear SVM separable case (contd)
dot product
17Linear SVM separable case (contd)
dot product
- It can be shown that if xk is not a support
vector, then the corresponding ?k0.
Only the support vectors contribute to the
solution!
18Linear SVM non-separable case
- Allow miss-classifications (i.e., soft margin
classifier) by introducing positive error (slack)
variables ?k
19Linear SVM non-separable case (contd)
- The constant c controls the trade-off between
margin and misclassification errors. - Aims to prevent outliers from affecting the
optimal hyperplane.
20Linear SVM non-separable case (contd)
- Easier to solve the dual problem (Kuhn-Tucker
construction)
21Nonlinear SVM
- Extending these concepts to the non-linear case
involves mapping the data to a high-dimensional
space h - Mapping the data to a sufficiently high
dimensional space is likely to cast the data
linearly separable in that space.
22Nonlinear SVM (contd)
Example
23Nonlinear SVM (contd)
linear SVM
non-linear SVM
24Nonlinear SVM (contd)
- The disadvantage of this approach is that the
mapping - might be very computationally intensive to
compute! - Is there an efficient way to compute
?
non-linear SVM
25The kernel trick
- Compute dot products using a kernel function
26The kernel trick (contd)
- Comments
- Kernel functions which can be expressed as a dot
product in some space satisfy the Mercers
condition (see Burges paper) - The Mercers condition does not tell us how to
construct F() or even what the high dimensional
space is. - Advantages of kernel trick
- No need to know F()
- Computations remain feasible even if the feature
space has high dimensionality.
27Polynomial Kernel
K(x,y)(x . y) d
28Polynomial Kernel - Example
29Common Kernel functions
30Example
31Example (contd)
h6
32Example (contd)
33Example (contd)
(Problem 4)
34Example (contd)
35Example (contd)
36Example (contd)
37Comments
- SVM is based on exact optimization, not on
approximate methods (i.e., global optimization
method, no local optima) - Appears to avoid overfitting in high dimensional
spaces and generalize well using a small training
set. - Performance depends on the choice of the kernel
and its parameters. - Its complexity depends on the number of support
vectors, not on the dimensionality of the
transformed space.