Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

About This Presentation
Title:

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Description:

Title: CS479/679 Pattern Recognition Spring 2006 Prof. Bebis Last modified by: George Bebis Document presentation format: On-screen Show (4:3) –

Number of Views:63
Avg rating:3.0/5.0
Slides: 38
Provided by: cseUnrEd
Learn more at: https://www.cse.unr.edu
Category:

less

Transcript and Presenter's Notes

Title: Support Vector Machines (SVMs) Chapter 5 (Duda et al.)


1
Support Vector Machines (SVMs) Chapter 5 (Duda
et al.)
CS479/679 Pattern RecognitionDr. George Bebis
2
Learning through empirical risk minimization
  • Estimate g(x) from a finite set of observations
    by minimizing an error function, for example,
    the training error (also called empirical risk)

class labels
3
Learning through empirical risk minimization
(contd)
  • Conventional empirical risk minimization does not
    imply good generalization performance.
  • There could be several different functions g(x)
    which all approximate the training data set well.
  • Difficult to determine which function would have
    the best generalization performance.

4
Learning through empirical risk minimization
(contd)

Solution 1
Solution 2
Which solution is better?
5
Statistical LearningCapacity and VC dimension
  • To guarantee good generalization performance, the
    capacity (i.e., complexity) of the learned
    functions must be controlled.
  • Functions with high capacity are more complicated
    (i.e., have many degrees of freedom).

high capacity
low capacity
6
Statistical LearningCapacity and VC dimension
(contd)
  • How do we measure capacity?
  • In statistical learning, the Vapnik-Chervonenkis
    (VC) dimension is a popular measure of capacity.
  • The VC dimension can predict a probabilistic
    upper bound on the generalization error of a
    classifier.

7
Statistical LearningCapacity and VC dimension
(contd)
  • A function that
  • (1) minimizes the empirical risk and
  • (2) has low VC dimension
  • will generalize well regardless of the
    dimensionality of the input space
  • with probability (1-d) (n of training
    examples)
  • (Vapnik, 1995, Structural Risk
    Minimization Principle)

structural risk minimization
8
VC dimension and margin of separation
  • Vapnik has shown that maximizing the margin of
    separation (i.e., empty space between classes) is
    equivalent to minimizing the VC dimension.
  • The optimal hyperplane is the one giving the
    largest margin of separation between the classes.

9
Margin of separation and support vectors
  • How is the margin defined?
  • The margin is defined by the distance of the
    nearest training samples from the hyperplane.
  • We refer to these samples as support vectors.
  • Intuitively speaking, these are the most
    difficult samples to classify.

10
Margin of separation andsupport vectors (contd)

different solutions
corresponding margins
11
SVM Overview
  • Primarily two-class classifiers but can be
    extended to multiple classes.
  • It performs structural risk minimization to
    achieve good generalization performance.
  • The optimization criterion is the margin of
    separation between classes.
  • Training is equivalent to solving a quadratic
    programming problem with linear constraints.

12
Linear SVM separable case
  • Linear discriminant
  • Class labels
  • Consider the equivalent problem

Decide ?1 if g(x) gt 0 and ?2 if g(x) lt 0
13
Linear SVM separable case (contd)
  • The distance of a point xk from the separating
    hyperplane should satisfy the constraint
  • To constraint the length of w (uniqueness), we
    impose
  • Using the above constraint

14
Linear SVM separable case (contd)
quadratic programming problem
maximize margin

15
Linear SVM separable case (contd)
  • Using Langrange optimization, minimize
  • Easier to solve the dual problem (Kuhn-Tucker
    construction)

16
Linear SVM separable case (contd)
  • The solution is given by

dot product
17
Linear SVM separable case (contd)
dot product
  • It can be shown that if xk is not a support
    vector, then the corresponding ?k0.

Only the support vectors contribute to the
solution!
18
Linear SVM non-separable case
  • Allow miss-classifications (i.e., soft margin
    classifier) by introducing positive error (slack)
    variables ?k

 
19
Linear SVM non-separable case (contd)
  • The constant c controls the trade-off between
    margin and misclassification errors.
  • Aims to prevent outliers from affecting the
    optimal hyperplane.

20
Linear SVM non-separable case (contd)
  • Easier to solve the dual problem (Kuhn-Tucker
    construction)

21
Nonlinear SVM
  • Extending these concepts to the non-linear case
    involves mapping the data to a high-dimensional
    space h
  • Mapping the data to a sufficiently high
    dimensional space is likely to cast the data
    linearly separable in that space.

22
Nonlinear SVM (contd)
Example

23
Nonlinear SVM (contd)
linear SVM
non-linear SVM

24
Nonlinear SVM (contd)
  • The disadvantage of this approach is that the
    mapping
  • might be very computationally intensive to
    compute!
  • Is there an efficient way to compute
    ?

non-linear SVM

25
The kernel trick
  • Compute dot products using a kernel function

26
The kernel trick (contd)
  • Comments
  • Kernel functions which can be expressed as a dot
    product in some space satisfy the Mercers
    condition (see Burges paper)
  • The Mercers condition does not tell us how to
    construct F() or even what the high dimensional
    space is.
  • Advantages of kernel trick
  • No need to know F()
  • Computations remain feasible even if the feature
    space has high dimensionality.

27
Polynomial Kernel
K(x,y)(x . y) d

28
Polynomial Kernel - Example

29
Common Kernel functions

30
Example

31
Example (contd)

h6
32
Example (contd)

33
Example (contd)
(Problem 4)

34
Example (contd)

35
Example (contd)
36
Example (contd)

37
Comments
  • SVM is based on exact optimization, not on
    approximate methods (i.e., global optimization
    method, no local optima)
  • Appears to avoid overfitting in high dimensional
    spaces and generalize well using a small training
    set.
  • Performance depends on the choice of the kernel
    and its parameters.
  • Its complexity depends on the number of support
    vectors, not on the dimensionality of the
    transformed space.
Write a Comment
User Comments (0)
About PowerShow.com