Title: Outline
1Outline
- Introduction
- Optimal hyperplane for linearly separable
patterns - How to build a support vector machine for pattern
recognition - Example XOR problem
2Multilayer Perceptron (MLP) Properties
- Universal approximation of continuous nonlinear
functions. - Learning from input-output patterns either
off-line or on-line learning. - Parallel network architecture, multiple inputs
and outputs - Use in feedforward and recurrent networks
- Use in supervised and unsupervised
learning - applications
3Problems of MLP
- Existence of many local minima!
- How many neurons needed for a given task?
4Support Vector Machines (SVM)
- Nonlinear classification and function estimation
by convex optimization with a unique solution and
primal-dual interpretations. - Number of neurons automatically follows from a
convex program. - Learning and generalization in huge dimensional
input spaces (able to avoid the curse of
dimensionality!) - Use of kernels (e.g. linear, polynomial, RBF,
MLP, splines, ... ).
5Support Vector Machines (SVM)
6Outline
- Introduction
- Optimal hyperplane for linearly separable
patterns - How to build a support vector machine for pattern
recognition - Example XOR problem
7Linearly Separable Patterns
- Consider the training samples (xi,di) iN 1,
- where xi is the input pattern for the ith
example and - di is the corresponding desired
output. - To begin with, we assume that the pattern
represented by the subset di 1 and the pattern
represented by the subset di ?1 are linearly
separable.
8Hyperplane
- The equation of a decision surface in the form of
a hyperplane that does the separation is -
- where x is an input vector, w is an
adjustable weight vector, and b is a bias. - We may write
- For a given weight vector w and bias b, the
separation between the closest data point is
called the margin of separation, denoted by ?.
9Optimal Separating Hyperplanes
Suboptimal (dashed) and optimal (bold)
separating hyperplanes
10Margin of Separation
- The goal of a support vector machine is to find
the particular hyperplane for which the margin of
separation ? is maximized.
11Optimal Hyperplane
- Let wo and bo denote the optimum values of the
weight vector and bias, respectively. - The optimal hyperplane is defined by
- The discriminant function
- gives an algebraic measure of the distance
from x to the optimal hyperplane. -
(6.3)
(6.4)
12Figure for Optimal Hyperplane
x
xp
13Optimal Hyperplane
- The easiest way to see this is to express x as
-
- where xp is the normal projection of x onto
the optimal hyperplane,and r is the desired
algebraic distance - r is positive if x is on the positive side of the
optimal hyperplane. - r is negative if x is on the negative side.
14Optimal Hyperplane
- By definition, g(xp)0, it follows that
- or
(6.5)
15Optimal Hyperplane
- The distance from origin (i.e.,x0) to the
optimal hyperplane is given by b0/w0. -
- If b0 gt 0, the origin is on the positive side
of the optimal hyperplane - If b0 lt 0, it is on the negative side.
- If b0 0, the optimal hyperplane passes
through the origin.
16Optimal Hyperplane
- Given the training set (xi,di).
- The pair(wo,bo) must satisfy the constraint
- If (6.2) holds, the patterns are linearly
separable, we can rescale wo and bo such that
(6.6) holds,this scaling operation leaves Eq.
(6.3) unaffected.
(6.6)
17linearly separable
rescaled result
18Optimal Hyperplane
- The particular data points (xi,di) for which the
first or second line of (6.6) is satisfied with
the equality sign are called support vectors. - The support vectors are those data points that
lie closest to the decision surface and are
therefore the most difficult to classify. - Support vectors have a direct bearing on the
optimum location of the decision surface.
19Optimal Hyperplane
- Consider a support vector x(s) for which d(s)
1.Then - From (6.5) the algebraic distance from the
support vector x(s) to the optimal hyperplane is
20Optimal Hyperplane
- Let ? denote the optimal value of the margin of
separation between the two classes that
constitute the training set F. From (6.8) -
- ? maximizing ? is equivalent to minimizing
- the norm of the weight vector w.
21Alternative View for ?
Support vectors
?
22Primal problem
- The primary problem
- Given the training sample F(xi,di) iN 1,
find the optimal values of the weight vector w
and bias b such that they satisfy the constraints
-
- and the weight vector w minimizes the cost
function -
23Primal problem (cont.)
- The scaling factor ½ is for convenience of
presentation. - We may solve the constrained optimization problem
using the method of Lagrange multipliers
24Constrained Optimization Problem
Objective function or Cost function
Equality Constraints
where
We assume that the function h is continuously
differentiable, that is,
25Two-dimension case
maximize (or minimize) f(x,y)
subject to g(x,y)0
X
26Lagrange Multiplier
- Hence, they are proportional to each other ?f
??g, where?is some number that we dont at
present know the value of. This number?is called
a Lagrange multiplier. - Rearranging, we see that
- ?f -??g0 ?F
- where F(x,y)f(x,y) ?g(x,y)
- F(x,y) is constructed as above and then we set
all partial derivatives of F to zero and solve
the set of simultaneous equations we get.
27Exmaple
Solution
28Remarks
- Lagrange condition is only necessary but not
sufficient. - That is a solution x satisfying the above
Lagrange condition need not be an extremizer. - Only when some other conditions are satisfied,
then we can say x is an extremizer.
29Primal Problem
- The cost function F(w) is a convex function of w.
- The constrains are linear in w.
30Lagrange function
- Construct the Lagrange function (or Lagrangian)
- where ?i are Lagrange multipliers.
Formally, we have
The patterns xi for which ai gt 0 are called
Support Vectors.
31Condition 1
32Condition 2
33Optimal Hyperplane for Linearly Separable Patters
- Two condition of optimality
-
- condition 1
- condition 2
(6.12)
(6.13)
34Dual Problem
- If the primal problem has an optimal solution,
the dual problem also has an optimal solution,
and the corresponding optimal values are equal. - In order for w0, to be an optimal primal solution
and to be an optimal dual solution. - It is necessary and sufficient that w0 is
feasible for the primal problem, and
35(6.15)
360
(6.16)
,
37Dual Problem
- The dual problem
- Given the training sample (xi,di) iN
1,find the Lagrange multipliers (? i) iN 1
that maximize the objective function -
- subject to the constraints
- (1)
- (2) for i1,2,,N
The dual problem is cast entirely in terms of the
training data.
Q(a) to be maximized depends only on the input
patterns in the form of dot product.
38Dual Problem (cont.)
39Optimal Hyperplane for Linearly Separable Patters
Step 2 Compute the optimum weight vector
Step 3 Compute the optimum bias
40Nonseparable Patterns
Correct classification
Misclassification
41Nonseparable Patterns (cont.)
- It is not possible to construct a separating
hyperplane without encountering classification
errors. - The margin of separation between classes is said
to be soft if a data point (xi, di) violates the
following condition
42Soft Marginal Hyperplane
- To allow for the violation, a so-called slack
variables are introduced. - The relaxed separation constraints is shown as
43Soft Marginal Hyperplane
- Our goal is to find a separating hyperplane for
which the misclassification error, averaged on
the training set, is minimized.
44Soft Marginal Hyperplane
- Unfortunately, minimization of with
respect to w is a nonconvex optimization problem
that is NP-complete. - To make the optimization problem mathematically
tractable, we approximate the functional
by writing
45Primal Problem
- We reformulate the primal optimization problem as
follows.
minimize
subject to
46Soft Marginal Hyperplane
- The parameter C controls the tradeoff between
complexity of the machine and the number of
nonseparable points. - It may be viewed as a form of a "regularization"
parameter. - The parameter C is determined experimentally via
the standard use of a training/test set. - Generally speaking, a technique knows as
cross-validation for verifying performance using
only the training set.
47Dual Problem
48Outline
- Introduction
- Optimal hyperplane for linearly separable
patterns - How to build a support vector machine for pattern
recognition - ExampleXor problem
49How to Build a Support Vector Machine for Pattern
Recognition
- The idea of a SVM hinges on two steps
- (1) Nonlinear mapping of an input vector into a
- high-dimensional feature space that is
hidden - from both the input and output.
- (2) Construction of an optimal hyperplane for
- separating the features discovered in step1.
50Non-linear
51How to Build a Support Vector Machine for Pattern
Recognition
52(No Transcript)
53Nonlinear Support Vector Machine
Linear separable in original space
Linear separable in feature space
54Kernel Function
- From now on, we can compute the inner product
between the projections of two points into the
feature space without explicitly evaluating their
coordinates. - The feature space is not uniquely determined by
the kernel function. That is
55Kernel Trick
- Expressing everything in terms of inner products
in feature space and using the kernel function to
efficiently compute these inner products is the
kernel trick.
Original space
Feature space
- Everything is done with the kernel function, it
is not even necessary to know the feature space
and the inner product within it.
56Example
57Example (cont.)
Hence
58Mercer's theorem
59Mercer's theorem
- Mercers theorem provides a coordinate basis
representation. - Mercers theorem guarantee the existence of
kernel trick.
60Some Kernels
61Decision Function in Feature Space
Since the optimum weight vector is
We can rewrite the formula to
62Remarks
- In the radial-basis function (RBF) type of a
support vector machine, the number of
radial-basis functions and their centers are
determined automatically by the number of support
vectors and their values, respectively. - In the two-layer perceptron type of a support
vector machine, the number of hidden neurons and
their weight vectors are determined automatically
by the number of support vectors and their
values, respectively.
63Example
64Architecture of support vector machine
Bias b
K(x,x1)
x1
K(x,x2)
y
Output neuron
Input vector x
x2
Linear outputs
K(x,xm1)
xmo
Hidden layer of m1 Inner-product kernels
Input layer of size mo
65Outline
- Introduction
- Optimal hyperplane for linearly separable
patterns - How to build a support vector machine for pattern
recognition - ExampleXor problem
66ExampleXor problem
Let (Cherkassky and Mulier,1998)
with
67- Express the inner-product kernel
- The image of the input vector x induced in the
feature space - Similarly,
-
68To Build the Kernel Matrix
? ?
69- Then we can the kernel matrix as follows.
70The objective function
71- Optimizing Q(?) with respect to the Lagrange
multipliers -
- Hence,the optimum values of the Lagrange
- multipliers are
-
- All Lagrange multipliers are bigger then 0, and
therefore the four input vectors xi iN 1 are
support vectors
72- We find the optimum weight vector
73 74y - x1x2
Decision boundary