Outline - PowerPoint PPT Presentation

1 / 74

About This Presentation

Title:

Outline

Description:

Learning from input-output patterns; either off-line or on-line learning. ... Support vectors have a direct bearing on the optimum location of the decision surface. ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 75

Provided by: neuralpc1

Category:

more less

Transcript and Presenter's Notes

Title: Outline

1
Outline

Introduction
Optimal hyperplane for linearly separable
patterns
How to build a support vector machine for pattern
recognition
Example XOR problem

2
Multilayer Perceptron (MLP) Properties

Universal approximation of continuous nonlinear
functions.
Learning from input-output patterns either
off-line or on-line learning.
Parallel network architecture, multiple inputs
and outputs
Use in feedforward and recurrent networks
Use in supervised and unsupervised
learning
applications

3
Problems of MLP

Existence of many local minima!
How many neurons needed for a given task?

4
Support Vector Machines (SVM)

Nonlinear classification and function estimation
by convex optimization with a unique solution and
primal-dual interpretations.
Number of neurons automatically follows from a
convex program.
Learning and generalization in huge dimensional
input spaces (able to avoid the curse of
dimensionality!)
Use of kernels (e.g. linear, polynomial, RBF,
MLP, splines, ... ).

5
Support Vector Machines (SVM)
6
Outline

Introduction
Optimal hyperplane for linearly separable
patterns
How to build a support vector machine for pattern
recognition
Example XOR problem

7
Linearly Separable Patterns

Consider the training samples (xi,di) iN 1,
where xi is the input pattern for the ith
example and
di is the corresponding desired
output.
To begin with, we assume that the pattern
represented by the subset di 1 and the pattern
represented by the subset di ?1 are linearly
separable.

8
Hyperplane

The equation of a decision surface in the form of
a hyperplane that does the separation is
where x is an input vector, w is an
adjustable weight vector, and b is a bias.
We may write
For a given weight vector w and bias b, the
separation between the closest data point is
called the margin of separation, denoted by ?.

9
Optimal Separating Hyperplanes
Suboptimal (dashed) and optimal (bold)
separating hyperplanes
10
Margin of Separation

The goal of a support vector machine is to find
the particular hyperplane for which the margin of
separation ? is maximized.

11
Optimal Hyperplane

Let wo and bo denote the optimum values of the
weight vector and bias, respectively.
The optimal hyperplane is defined by
The discriminant function
gives an algebraic measure of the distance
from x to the optimal hyperplane.

(6.3)
(6.4)
12
Figure for Optimal Hyperplane
x
xp
13
Optimal Hyperplane

The easiest way to see this is to express x as
where xp is the normal projection of x onto
the optimal hyperplane,and r is the desired
algebraic distance
r is positive if x is on the positive side of the
optimal hyperplane.
r is negative if x is on the negative side.

14
Optimal Hyperplane

By definition, g(xp)0, it follows that
or

(6.5)
15
Optimal Hyperplane

The distance from origin (i.e.,x0) to the
optimal hyperplane is given by b0/w0.
If b0 gt 0, the origin is on the positive side
of the optimal hyperplane
If b0 lt 0, it is on the negative side.
If b0 0, the optimal hyperplane passes
through the origin.

16
Optimal Hyperplane

Given the training set (xi,di).
The pair(wo,bo) must satisfy the constraint
If (6.2) holds, the patterns are linearly
separable, we can rescale wo and bo such that
(6.6) holds,this scaling operation leaves Eq.
(6.3) unaffected.

(6.6)
17
linearly separable
rescaled result
18
Optimal Hyperplane

The particular data points (xi,di) for which the
first or second line of (6.6) is satisfied with
the equality sign are called support vectors.
The support vectors are those data points that
lie closest to the decision surface and are
therefore the most difficult to classify.
Support vectors have a direct bearing on the
optimum location of the decision surface.

19
Optimal Hyperplane

Consider a support vector x(s) for which d(s)
1.Then
From (6.5) the algebraic distance from the
support vector x(s) to the optimal hyperplane is

20
Optimal Hyperplane

Let ? denote the optimal value of the margin of
separation between the two classes that
constitute the training set F. From (6.8)
? maximizing ? is equivalent to minimizing
the norm of the weight vector w.

21
Alternative View for ?
Support vectors
?
22
Primal problem

The primary problem
Given the training sample F(xi,di) iN 1,
find the optimal values of the weight vector w
and bias b such that they satisfy the constraints
and the weight vector w minimizes the cost
function

23
Primal problem (cont.)

The scaling factor ½ is for convenience of
presentation.
We may solve the constrained optimization problem
using the method of Lagrange multipliers

24
Constrained Optimization Problem
Objective function or Cost function
Equality Constraints
where
We assume that the function h is continuously
differentiable, that is,
25
Two-dimension case
maximize (or minimize) f(x,y)
subject to g(x,y)0
X
26
Lagrange Multiplier

Hence, they are proportional to each other ?f
??g, where?is some number that we dont at
present know the value of. This number?is called
a Lagrange multiplier.
Rearranging, we see that
?f -??g0 ?F
where F(x,y)f(x,y) ?g(x,y)
F(x,y) is constructed as above and then we set
all partial derivatives of F to zero and solve
the set of simultaneous equations we get.

27
Exmaple
Solution
28
Remarks

Lagrange condition is only necessary but not
sufficient.
That is a solution x satisfying the above
Lagrange condition need not be an extremizer.
Only when some other conditions are satisfied,
then we can say x is an extremizer.

29
Primal Problem

The cost function F(w) is a convex function of w.
The constrains are linear in w.

30
Lagrange function

Construct the Lagrange function (or Lagrangian)
where ?i are Lagrange multipliers.

Formally, we have
The patterns xi for which ai gt 0 are called
Support Vectors.
31
Condition 1
32
Condition 2
33
Optimal Hyperplane for Linearly Separable Patters

Two condition of optimality
condition 1
condition 2

(6.12)
(6.13)
34
Dual Problem

If the primal problem has an optimal solution,
the dual problem also has an optimal solution,
and the corresponding optimal values are equal.
In order for w0, to be an optimal primal solution
and to be an optimal dual solution.
It is necessary and sufficient that w0 is
feasible for the primal problem, and

35
(6.15)
36
0
(6.16)
,
37
Dual Problem

The dual problem
Given the training sample (xi,di) iN
1,find the Lagrange multipliers (? i) iN 1
that maximize the objective function
subject to the constraints
(1)
(2) for i1,2,,N

The dual problem is cast entirely in terms of the
training data.
Q(a) to be maximized depends only on the input
patterns in the form of dot product.
38
Dual Problem (cont.)

39
Optimal Hyperplane for Linearly Separable Patters
Step 2 Compute the optimum weight vector
Step 3 Compute the optimum bias
40
Nonseparable Patterns
Correct classification
Misclassification
41
Nonseparable Patterns (cont.)

It is not possible to construct a separating
hyperplane without encountering classification
errors.
The margin of separation between classes is said
to be soft if a data point (xi, di) violates the
following condition

42
Soft Marginal Hyperplane

To allow for the violation, a so-called slack
variables are introduced.
The relaxed separation constraints is shown as

43
Soft Marginal Hyperplane

Our goal is to find a separating hyperplane for
which the misclassification error, averaged on
the training set, is minimized.

44
Soft Marginal Hyperplane

Unfortunately, minimization of with
respect to w is a nonconvex optimization problem
that is NP-complete.
To make the optimization problem mathematically
tractable, we approximate the functional
by writing

45
Primal Problem

We reformulate the primal optimization problem as
follows.

minimize
subject to
46
Soft Marginal Hyperplane

The parameter C controls the tradeoff between
complexity of the machine and the number of
nonseparable points.
It may be viewed as a form of a "regularization"
parameter.
The parameter C is determined experimentally via
the standard use of a training/test set.
Generally speaking, a technique knows as
cross-validation for verifying performance using
only the training set.

47
Dual Problem
48
Outline

Introduction
Optimal hyperplane for linearly separable
patterns
How to build a support vector machine for pattern
recognition
ExampleXor problem

49
How to Build a Support Vector Machine for Pattern
Recognition

The idea of a SVM hinges on two steps
(1) Nonlinear mapping of an input vector into a
high-dimensional feature space that is
hidden
from both the input and output.
(2) Construction of an optimal hyperplane for
separating the features discovered in step1.

50
Non-linear
51
How to Build a Support Vector Machine for Pattern
Recognition
52
(No Transcript)
53
Nonlinear Support Vector Machine
Linear separable in original space
Linear separable in feature space
54
Kernel Function

From now on, we can compute the inner product
between the projections of two points into the
feature space without explicitly evaluating their
coordinates.
The feature space is not uniquely determined by
the kernel function. That is

55
Kernel Trick

Expressing everything in terms of inner products
in feature space and using the kernel function to
efficiently compute these inner products is the
kernel trick.

Original space
Feature space

Everything is done with the kernel function, it
is not even necessary to know the feature space
and the inner product within it.

56
Example
57
Example (cont.)
Hence
58
Mercer's theorem
59
Mercer's theorem

Mercers theorem provides a coordinate basis
representation.
Mercers theorem guarantee the existence of
kernel trick.

60
Some Kernels
61
Decision Function in Feature Space
Since the optimum weight vector is
We can rewrite the formula to
62
Remarks

In the radial-basis function (RBF) type of a
support vector machine, the number of
radial-basis functions and their centers are
determined automatically by the number of support
vectors and their values, respectively.
In the two-layer perceptron type of a support
vector machine, the number of hidden neurons and
their weight vectors are determined automatically
by the number of support vectors and their
values, respectively.

63
Example
64
Architecture of support vector machine
Bias b
K(x,x1)
x1
K(x,x2)
y
Output neuron
Input vector x
x2

Linear outputs
K(x,xm1)
xmo
Hidden layer of m1 Inner-product kernels
Input layer of size mo
65
Outline

Introduction
Optimal hyperplane for linearly separable
patterns
How to build a support vector machine for pattern
recognition
ExampleXor problem

66
ExampleXor problem
Let (Cherkassky and Mulier,1998)
with
67

Express the inner-product kernel
The image of the input vector x induced in the
feature space
Similarly,

68
To Build the Kernel Matrix
? ?
69

Then we can the kernel matrix as follows.

70
The objective function
71

Optimizing Q(?) with respect to the Lagrange
multipliers
Hence,the optimum values of the Lagrange
multipliers are
All Lagrange multipliers are bigger then 0, and
therefore the four input vectors xi iN 1 are
support vectors