Outline - PowerPoint PPT Presentation

1 / 74
About This Presentation
Title:

Outline

Description:

Learning from input-output patterns; either off-line or on-line learning. ... Support vectors have a direct bearing on the optimum location of the decision surface. ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 75
Provided by: neuralpc1
Category:
Tags: line | on | optimum | outline

less

Transcript and Presenter's Notes

Title: Outline


1
Outline
  • Introduction
  • Optimal hyperplane for linearly separable
    patterns
  • How to build a support vector machine for pattern
    recognition
  • Example XOR problem

2
Multilayer Perceptron (MLP) Properties
  • Universal approximation of continuous nonlinear
    functions.
  • Learning from input-output patterns either
    off-line or on-line learning.
  • Parallel network architecture, multiple inputs
    and outputs
  • Use in feedforward and recurrent networks
  • Use in supervised and unsupervised
    learning
  • applications

3
Problems of MLP
  • Existence of many local minima!
  • How many neurons needed for a given task?

4
Support Vector Machines (SVM)
  • Nonlinear classification and function estimation
    by convex optimization with a unique solution and
    primal-dual interpretations.
  • Number of neurons automatically follows from a
    convex program.
  • Learning and generalization in huge dimensional
    input spaces (able to avoid the curse of
    dimensionality!)
  • Use of kernels (e.g. linear, polynomial, RBF,
    MLP, splines, ... ).

5
Support Vector Machines (SVM)
6
Outline
  • Introduction
  • Optimal hyperplane for linearly separable
    patterns
  • How to build a support vector machine for pattern
    recognition
  • Example XOR problem

7
Linearly Separable Patterns
  • Consider the training samples (xi,di) iN 1,
  • where xi is the input pattern for the ith
    example and
  • di is the corresponding desired
    output.
  • To begin with, we assume that the pattern
    represented by the subset di 1 and the pattern
    represented by the subset di ?1 are linearly
    separable.

8
Hyperplane
  • The equation of a decision surface in the form of
    a hyperplane that does the separation is
  • where x is an input vector, w is an
    adjustable weight vector, and b is a bias.
  • We may write
  • For a given weight vector w and bias b, the
    separation between the closest data point is
    called the margin of separation, denoted by ?.

9
Optimal Separating Hyperplanes
Suboptimal (dashed) and optimal (bold)
separating hyperplanes
10
Margin of Separation
  • The goal of a support vector machine is to find
    the particular hyperplane for which the margin of
    separation ? is maximized.

11
Optimal Hyperplane
  • Let wo and bo denote the optimum values of the
    weight vector and bias, respectively.
  • The optimal hyperplane is defined by
  • The discriminant function
  • gives an algebraic measure of the distance
    from x to the optimal hyperplane.

(6.3)
(6.4)
12
Figure for Optimal Hyperplane
x
xp
13
Optimal Hyperplane
  • The easiest way to see this is to express x as
  • where xp is the normal projection of x onto
    the optimal hyperplane,and r is the desired
    algebraic distance
  • r is positive if x is on the positive side of the
    optimal hyperplane.
  • r is negative if x is on the negative side.

14
Optimal Hyperplane
  • By definition, g(xp)0, it follows that
  • or

(6.5)
15
Optimal Hyperplane
  • The distance from origin (i.e.,x0) to the
    optimal hyperplane is given by b0/w0.
  • If b0 gt 0, the origin is on the positive side
    of the optimal hyperplane
  • If b0 lt 0, it is on the negative side.
  • If b0 0, the optimal hyperplane passes
    through the origin.

16
Optimal Hyperplane
  • Given the training set (xi,di).
  • The pair(wo,bo) must satisfy the constraint
  • If (6.2) holds, the patterns are linearly
    separable, we can rescale wo and bo such that
    (6.6) holds,this scaling operation leaves Eq.
    (6.3) unaffected.

(6.6)
17
linearly separable
rescaled result
18
Optimal Hyperplane
  • The particular data points (xi,di) for which the
    first or second line of (6.6) is satisfied with
    the equality sign are called support vectors.
  • The support vectors are those data points that
    lie closest to the decision surface and are
    therefore the most difficult to classify.
  • Support vectors have a direct bearing on the
    optimum location of the decision surface.

19
Optimal Hyperplane
  • Consider a support vector x(s) for which d(s)
    1.Then
  • From (6.5) the algebraic distance from the
    support vector x(s) to the optimal hyperplane is

20
Optimal Hyperplane
  • Let ? denote the optimal value of the margin of
    separation between the two classes that
    constitute the training set F. From (6.8)
  • ? maximizing ? is equivalent to minimizing
  • the norm of the weight vector w.

21
Alternative View for ?
Support vectors
?
22
Primal problem
  • The primary problem
  • Given the training sample F(xi,di) iN 1,
    find the optimal values of the weight vector w
    and bias b such that they satisfy the constraints
  • and the weight vector w minimizes the cost
    function

23
Primal problem (cont.)
  • The scaling factor ½ is for convenience of
    presentation.
  • We may solve the constrained optimization problem
    using the method of Lagrange multipliers

24
Constrained Optimization Problem
Objective function or Cost function
Equality Constraints
where
We assume that the function h is continuously
differentiable, that is,
25
Two-dimension case
maximize (or minimize) f(x,y)
subject to g(x,y)0
X
26
Lagrange Multiplier
  • Hence, they are proportional to each other ?f
    ??g, where?is some number that we dont at
    present know the value of. This number?is called
    a Lagrange multiplier.
  • Rearranging, we see that
  • ?f -??g0 ?F
  • where F(x,y)f(x,y) ?g(x,y)
  • F(x,y) is constructed as above and then we set
    all partial derivatives of F to zero and solve
    the set of simultaneous equations we get.

27
Exmaple
Solution
28
Remarks
  • Lagrange condition is only necessary but not
    sufficient.
  • That is a solution x satisfying the above
    Lagrange condition need not be an extremizer.
  • Only when some other conditions are satisfied,
    then we can say x is an extremizer.

29
Primal Problem
  • The cost function F(w) is a convex function of w.
  • The constrains are linear in w.

30
Lagrange function
  • Construct the Lagrange function (or Lagrangian)
  • where ?i are Lagrange multipliers.

Formally, we have
The patterns xi for which ai gt 0 are called
Support Vectors.
31
Condition 1
32
Condition 2
33
Optimal Hyperplane for Linearly Separable Patters
  • Two condition of optimality
  • condition 1
  • condition 2

(6.12)
(6.13)
34
Dual Problem
  • If the primal problem has an optimal solution,
    the dual problem also has an optimal solution,
    and the corresponding optimal values are equal.
  • In order for w0, to be an optimal primal solution
    and to be an optimal dual solution.
  • It is necessary and sufficient that w0 is
    feasible for the primal problem, and

35
(6.15)
36
0
(6.16)
,
37
Dual Problem
  • The dual problem
  • Given the training sample (xi,di) iN
    1,find the Lagrange multipliers (? i) iN 1
    that maximize the objective function
  • subject to the constraints
  • (1)
  • (2) for i1,2,,N

The dual problem is cast entirely in terms of the
training data.
Q(a) to be maximized depends only on the input
patterns in the form of dot product.
38
Dual Problem (cont.)
  • af
  • Df
  • w

39
Optimal Hyperplane for Linearly Separable Patters
Step 2 Compute the optimum weight vector
Step 3 Compute the optimum bias
40
Nonseparable Patterns
Correct classification
Misclassification
41
Nonseparable Patterns (cont.)
  • It is not possible to construct a separating
    hyperplane without encountering classification
    errors.
  • The margin of separation between classes is said
    to be soft if a data point (xi, di) violates the
    following condition

42
Soft Marginal Hyperplane
  • To allow for the violation, a so-called slack
    variables are introduced.
  • The relaxed separation constraints is shown as

43
Soft Marginal Hyperplane
  • Our goal is to find a separating hyperplane for
    which the misclassification error, averaged on
    the training set, is minimized.

44
Soft Marginal Hyperplane
  • Unfortunately, minimization of with
    respect to w is a nonconvex optimization problem
    that is NP-complete.
  • To make the optimization problem mathematically
    tractable, we approximate the functional
    by writing

45
Primal Problem
  • We reformulate the primal optimization problem as
    follows.

minimize
subject to
46
Soft Marginal Hyperplane
  • The parameter C controls the tradeoff between
    complexity of the machine and the number of
    nonseparable points.
  • It may be viewed as a form of a "regularization"
    parameter.
  • The parameter C is determined experimentally via
    the standard use of a training/test set.
  • Generally speaking, a technique knows as
    cross-validation for verifying performance using
    only the training set.

47
Dual Problem
48
Outline
  • Introduction
  • Optimal hyperplane for linearly separable
    patterns
  • How to build a support vector machine for pattern
    recognition
  • ExampleXor problem

49
How to Build a Support Vector Machine for Pattern
Recognition
  • The idea of a SVM hinges on two steps
  • (1) Nonlinear mapping of an input vector into a
  • high-dimensional feature space that is
    hidden
  • from both the input and output.
  • (2) Construction of an optimal hyperplane for
  • separating the features discovered in step1.

50
Non-linear
51
How to Build a Support Vector Machine for Pattern
Recognition
52
(No Transcript)
53
Nonlinear Support Vector Machine
Linear separable in original space
Linear separable in feature space
54
Kernel Function
  • From now on, we can compute the inner product
    between the projections of two points into the
    feature space without explicitly evaluating their
    coordinates.
  • The feature space is not uniquely determined by
    the kernel function. That is

55
Kernel Trick
  • Expressing everything in terms of inner products
    in feature space and using the kernel function to
    efficiently compute these inner products is the
    kernel trick.

Original space
Feature space
  • Everything is done with the kernel function, it
    is not even necessary to know the feature space
    and the inner product within it.

56
Example
57
Example (cont.)
Hence
58
Mercer's theorem
59
Mercer's theorem
  • Mercers theorem provides a coordinate basis
    representation.
  • Mercers theorem guarantee the existence of
    kernel trick.

60
Some Kernels
61
Decision Function in Feature Space
Since the optimum weight vector is
We can rewrite the formula to
62
Remarks
  • In the radial-basis function (RBF) type of a
    support vector machine, the number of
    radial-basis functions and their centers are
    determined automatically by the number of support
    vectors and their values, respectively.
  • In the two-layer perceptron type of a support
    vector machine, the number of hidden neurons and
    their weight vectors are determined automatically
    by the number of support vectors and their
    values, respectively.

63
Example
64
Architecture of support vector machine
Bias b
K(x,x1)
x1
K(x,x2)
y
Output neuron
Input vector x
x2


Linear outputs
K(x,xm1)
xmo
Hidden layer of m1 Inner-product kernels
Input layer of size mo
65
Outline
  • Introduction
  • Optimal hyperplane for linearly separable
    patterns
  • How to build a support vector machine for pattern
    recognition
  • ExampleXor problem

66
ExampleXor problem
Let (Cherkassky and Mulier,1998)
with
67
  • Express the inner-product kernel
  • The image of the input vector x induced in the
    feature space
  • Similarly,

68
To Build the Kernel Matrix
? ?
69
  • Then we can the kernel matrix as follows.

70
The objective function
71
  • Optimizing Q(?) with respect to the Lagrange
    multipliers
  • Hence,the optimum values of the Lagrange
  • multipliers are
  • All Lagrange multipliers are bigger then 0, and
    therefore the four input vectors xi iN 1 are
    support vectors

72
  • We find the optimum weight vector

73
  • From(6.33),

74
y - x1x2
Decision boundary
Write a Comment
User Comments (0)
About PowerShow.com