Title: Instructor : Saeed Shiry
1????? ????? ???????
2?????
- SVM ???? ???? ????? ?? ??? ?? ??? ???? Kernel
Methods ????????? ????? ????? ?????. - SVM?? ??? 1992 ???? Vapnik ????? ??? ? ?? ????
statistical learning theory ??? ?????? ???. - ???? SVM ????? ?????? ?? ?? ????? ???? ??? ????
??? ?? ?? ???? ??? ???? ???? ????? ??? ??????
????? 1.1 ???
3?????
- ??? ??? ???? ???????? ?? ????? ? ?????? ????
??????? ?????? ?? ???? ???? ( ?? ???? ??????????
???? ????? ??????? ??????? ? ????) - ????? ????
- ??????? ?????? ?? ????? ????? ????
- ????? ?? ????? overfitting ????? ????
4???? ????
- ?? ??? ????? ???? ?? ????? ??? ??????? ??????
??????? ???? ?? ?????? ????? (maximum margin) ??
???? ?? ???? ?? ???? ?? ?? ??? ????. - ?? ?????? ?? ???? ?? ????? ??? ??????? ??????
???? ?? ?? ???? ?? ????? ????? ????? ???? ??????
?? ????? ???? ?? ?? ??? ???? ???? ????? ??? ???
????.
5?????
- Support Vector Machines are a system for
efficiently training linear learning machines in
kernel-induced feature spaces, while respecting
the insights of generalisation theory and
exploiting optimisation theory. - Cristianini Shawe-Taylor (2000)
6????? ??????? ??? Linear Discrimination
- ??? ?? ???? ???? ????? ????? ?? ????? ??? ?? ??
??????? ?????? ?????? ??? ????? ??? ?? ???? ????? - ???????? ??? ?????? ?? ???? ???????? ????????
??? ??????? ?? ????? ????. - ??? ??? ??? ?????????? ????? ?? ???? ?????? ??
???????
7Intuitions
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
8Intuitions
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
9Intuitions
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
10Intuitions
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
11A Good Separator
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
12Noise in the Observations
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
13Ruling Out Some Separators
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
14Lots of Noise
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
15Maximizing the Margin
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
16??? ?????
- ??? ????? ?? ?????? ?????? ?? ????? ?????
- ?? ???? n ???? ?????? ???? ????? ??? ????? ???.
17?? ?? ??? ???? ??? ?????
- ??? ???? ???? ?????? ?? ( ??? ????) ?? ?? ????
?? ?? ?? ??? ???. ?? ???? ?? ???? ?????? ??? ??
????? ??? ???
- ?? ???? n ???? ?????? ????
18???? SVM ???? ??? ???? ???? ??
- ?? ???? ???? ??????
- ?? ???? ???? ????? ?? ???? ???? ???? ??? ???? ?
???? ?? ????? ?? ?? ??? ?????? ?? ?? ???? ??
?????? ????. - ???? ???? ???? ?? ??????? ????? ?? ?? ????? ????
????? ????? ?????? ??? ????? ????? ???.
19?????? ?????
- ?? ??? ???? ?? ?? ????? ??????? ??? ???????
?????? ?????? ???? ???? ??? ?????? ?? ???
????????? ???? ?? ??????? ?? ????? ???? ???
?????? ?? ?????? ????? ???? ????? ?? ????? ?????
???.
20??? ?????? ??????
- ?? ??? ????? ?? ????? ???? ??? ????.
- ????? ???? ??????? VC dimension ???? ???? ?? ????
???? ???? ????? ?????. - ???? ????? ??? ??? ???? ??? ???? ???? ???.
21????? ???????
- ????????? ???? ??? ?????? ?? ??? ???? ??? ???
????? ????? ??????? ?????? ??????
X2
SV
SV
SV
Class 1
X1
Class -1
22????? ? SVM
- ?? ???? ??????? ????? ?? SVM ??? ???????? ????
????? ???? ????? ???? - ?????? ????? ????? ???? (high dimensionality) ??
overfitting ????? ?????. ??? ????? ???? ??
optimization ??? ???????? ??? - ????? ???? ???????
- ???? ???? ??? ?????? ?? ???????? ??????? ???????
?????.
23?? ????? ???? ???? ?? ????
- ????? ??? ??????
- x ? ?n
- y ? -1, 1
- ???? ????? ????
- f(x) sign(ltw,xgt b)
- w ? ?n
- b ? ?
- ??? ????
- ltw, xgt b 0
- w1x1 w2x2 wnxn b 0
- ???????? ?????? W, b??????? ?? ???? ???? ??
- ????? ??? ?????? ?? ???? ???? ???? ???
- ?? ??? ??? ?? ???? ?? ????? ??? ??? ???? ?????
- ????? ?? ?????? ?????
24Linear SVM Mathematically
- Let training set (xi, yi)i1..n, xi?Rd, yi ?
-1, 1 be separated by a hyperplane with margin
?. Then for each training example (xi, yi) - For every support vector xs the above inequality
is an equality. After rescaling w and b by ?/2
in the equality, we obtain that distance between
each xs and the hyperplane is - Then the margin can be expressed through
(rescaled) w and b as
wTxi b - ?/2 if yi -1 wTxi b ?/2
if yi 1
yi(wTxi b) ?/2
?
25?? ????? ???? ???? ?? ????
- ????? ?? ???????? ?? ???? ??? ????? ??
- ????? ????? ?? ??? x ?? ?? ??? ????? ????? ??? ??
f(x)0
f(x)lt0
f(x)gt0
X2
????? w ?? ?? ?? ???? ???? ????? ???? ????? ???.
x
w
X1
26????? ????? ??? ???? ??? ?????
- Plus-plane x w . x b 1
- Minus-plane x w . x b -1
- Classify as..
- -1 if w . x b lt -1
- 1 if w . x b gt 1
27?????? ????? ?????
- ???? ???? ? ???? ?? ????? ??? ?? ??? ???????
- Plus-plane x w . x b 1
- Minus-plane x w . x b -1
- ????? w ?? ???? ???? ????? ???? ????? ???.
- ??? ???? X- ???? ?? ?? ???? ???? ???? ? X
????????? ???? ?? ???? ???? ?? X- ????.
28?????? ????? ?????
- ??? ?? X- ???? X ??? ????? ?? ?? ?? ???? ????
????? ???. ??? ????? ??? ?? ???? ????? ??W ?????
???. - ?? ??????? ?????? ????
- x x- ? w for some value of ?.
29?????? ????? ?????
- ??????? ??
- w . x b 1
- w . x- b -1
- X x- ? w
- x- x- M
- ??? ?????? M ?? ????? W? b ?????? ???.
30?????? ????? ?????
- w . x b 1
- w . x- b -1
- X x- ? w
- x- x- M
w.( x- ? w) b 1
w.x- ? w.w b 1
-1 ? w.w 1
?2/ w.w
31?????? ????? ?????
32???????
- ??? ???? ???? ?? ???? ??? ????? ???? ?? ?? ?? 1 ?
1- ???? ???? ????? - ltw,xigt b 1 for y1
- ltw,xigt b -1 for y -1
- ?? ?????? ????????? ??? ????
- yi (ltw,xigt b) 1 for all i
33??? ???? ?? ?????
- ?? SVM ?????? ?? ?????? ??????? ??? ?????
- ?? ????? ??????? ?????? (xi, yi) ?? i1,2,N
yi?1,-1 - Minimise w2
- Subject to yi (ltw,xigt b) 1 for all i
- Note that w2 wTw
- ??? ?? ????? quadratic programming ?? ???????
???? ????? ????????? ??? ???. ?????? ?????? ???
?? ???? ???? ????? ???? ????? ???? ???.
34Quadratic Programming
35Recap of Constrained Optimization
- Suppose we want to minimize f(x) subject to g(x)
0 - A necessary condition for x0 to be a solution
- a the Lagrange multiplier
- For multiple constraints gi(x) 0, i1, , m, we
need a Lagrange multiplier ai for each of the
constraints
36Recap of Constrained Optimization
- The case for inequality constraint gi(x) 0 is
similar, except that the Lagrange multiplier ai
should be positive - If x0 is a solution to the constrained
optimization problem - There must exist ai 0 for i1, , m such that
x0 satisfy - The function is
also known as the Lagrangrian we want to set its
gradient to 0
37??? ?? ??????
- Construct minimise the Lagrangian
- Take derivatives wrt. w and b, equate them to 0
- The Lagrange multipliers ai are called dual
variables - Each training point has an associated dual
variable.
- parameters are expressed as a linear combination
of training points - only SVs will have non-zero ai
38??? ?? ??????
Class 2
a100
a80.6
a70
a20
a50
a10.8
a40
a61.4
a90
a30
Class 1
39The Dual Problem
- If we substitute to
Lagrangian , we have - Note that
- This is a function of ai only
40The Dual Problem
- The new objective function is in terms of ai only
- It is known as the dual problem if we know w, we
know all ai if we know all ai, we know w - The original problem is known as the primal
problem - The objective function of the dual problem needs
to be maximized! - The dual problem is therefore
Properties of ai when we introduce the Lagrange
multipliers
The result when we differentiate the original
Lagrangian w.r.t. b
41The Dual Problem
- This is a quadratic programming (QP) problem
- A global maximum of ai can always be found
- w can be recovered by
42??? ?? ??????
- So,
- Plug this back into the Lagrangian to obtain the
dual formulation - The resulting dual that is solved for ? by using
a QP solver - The b does not appear in the dual so it is
determined separatelyfrom the initial
constraints
Data enters only in the form of dot products!
43???? ???? ???? ??? ????
- ?? ?? ???? ?????? (?, b) ?? ?? ???????
quadratic ?? ???? ???? ??? ????? ???? ???? ??????
SVM ?? ???? ???? ???? ????? ??? ???? ???? ???. - ??? x ?? ????? ???? ????? ???? ???? ?? ????? ???
???? ????? - signf(x, ?, b), where
Data enters only in the form of dot products!
44????? ??? ??? ??
- The solution of the SVM, i.e. of the quadratic
programming problem with linear inequality
constraints has the nice property that the data
enters only in the form of dot products! - Dot product (notation memory refreshing) given
x(x1,x2,xn) and y(y1,y2,yn), then the dot
product of x and y is xy(x1y1, x2y2,, xnyn). - This is nice because it allows us to make SVMs
non-linear without complicating the algorithm
45The Quadratic Programming Problem
- Many approaches have been proposed
- Loqo, cplex, etc.
- Most are interior-point methods
- Start with an initial solution that can violate
the constraints - Improve this solution by optimizing the objective
function and/or reducing the amount of constraint
violation - For SVM, sequential minimal optimization (SMO)
seems to be the most popular - A QP with two variables is trivial to solve
- Each iteration of SMO picks a pair of (ai,aj) and
solve the QP with these two variables repeat
until convergence - In practice, we can just regard the QP solver as
a black-box without bothering how it works
46???? ???? ?? ????? ??? ??? ???? ??????
- ?? ??? ????? ??? ?? SVM ??? ??? ?? ???? ?? ?????
??? ??????? ?????. ?? ?????? ?? ??? ?? ??????
????? ??? ??? ???? ????.
47?????? ????? ??? slack
- ?? ??? ?? ??? ??? ?? ????? ????? ???? ? ??????
??? ?? ???? ???? ?? ???????! - ??? ??? ?? ????? ????? xi ????? ????? ?? ??????
????? ????? ???? ??? ?? ???? ???? wTxb ???
??????? ??????.
48?????? ????? ??? slack
- ?? ????? ?????xi, i1, 2, , N, ??????? ??? ????
???? ?? ??? ? ????? - yi (ltw,xigt b) 1
- ????? ??? ????? ?????
- yi (ltw,xigt b) 1- xi , xi 0
- ?? ???? ???? ?? ??? ??? ????? ?? ???? ??? ?????.
49- ?? ??????? ????? ????? ???? ????? ????? ?? ?????
w ?? ???? ?? ?????? ??? ?????? ??? - ?? ?? ?? C gt 0 ??????. ???? ????? ??? ??? ????
?? ?? ????? ??? ???????? slack ?? ???? ?????.
50- ????? ????? ?? ???? ???? ????? ??? ????? ???.
- ????? ????? C ?? ???? ???? ??? ????? ??????
?????.
find ai that maximizes
subject to
51Soft Margin Hyperplane
- If we minimize wi xi, xi can be computed by
- xi are slack variables in optimization
- Note that xi0 if there is no error for xi
- xi is an upper bound of the number of errors
- We want to minimize
-
- C tradeoff parameter between error and margin
- The optimization problem becomes
52The Optimization Problem
- The dual of this new constrained optimization
problem is - w is recovered as
- This is very similar to the optimization problem
in the linear separable case, except that there
is an upper bound C on ai now - Once again, a QP solver can be used to find ai
53????? ??????? ??? ??? ??????? ?? ???? ?????
- ?????? ?? ????? ???? ?? ?? ???? ????? ???? ??
????? ??? ??????? ????
54???? ???? ?? ???? ?????
f(.)
Feature space
Input space
Note feature space is of higher dimension than
the input space in practice
- ????? ??????? ?? ???? ????? ??????? ??????? ????
???? ????? ????? ?????? ????. - ?? ???? ??? ????? ??? ??? ?? ????? ???.
- ???? ???? ?? ??? ???? ?? kernel trick ???????
?????.
55?????? ???? ?????
- ??? ???? ?? ???? ????? ?? ????? ???? ???? ???
- ????? ?? ????? ???? ???? ????? ???????? ???? ???
???? ????? ??? ?????? curse of dimensionality
????? ???.
56????? ??? ?????? ?? ???? ?????
- We will introduce Kernels
- Solve the computational problem of working with
many dimensions - Can make it possible to use infinite dimensions
- efficiently in time / space
- Other advantages, both practical and conceptual
57????
- Transform x ? ?(x)
- The linear algorithm depends only on xxi, hence
transformed algorithm depends only on ?(x)?(xi) - Use kernel function K(xi,xj) such that K(xi,xj)
?(x)?(xi)
58An Example for f(.) and K(.,.)
- Suppose f(.) is given as follows
- An inner product in the feature space is
- So, if we define the kernel function as follows,
there is no need to carry out f(.) explicitly - This use of kernel function to avoid carrying out
f(.) explicitly is known as the kernel trick
59???? ??? ?????
60???? ???? ??? ???? ??
61Modification Due to Kernel Function
- Change all inner products to kernel functions
- For training,
Original
With kernel function
62Modification Due to Kernel Function
- For testing, the new data z is classified as
class 1 if f³0, and as class 2 if f lt0
Original
With kernel function
63Modularity
- Any kernel-based learning algorithm composed of
two modules - A general purpose learning machine
- A problem specific kernel function
- Any K-B algorithm can be fitted with any kernel
- Kernels themselves can be constructed in a
modular way - Great for software engineering (and for analysis)
64???? ???? ??
- ?????? ?????? ??? ?? ???? ???? ?? ???? ???
- If K, K are kernels, then
- KK is a kernel
- cK is a kernel, if cgt0
- aKbK is a kernel, for a,b gt0
- Etc etc etc
- ?? ??? ????? ?????? ???? ??? ?????? ?? ?? ???
???? ??? ???? ?? ????.
65Example
- Suppose we have 5 1D data points
- x11, x22, x34, x45, x56, with 1, 2, 6 as
class 1 and 4, 5 as class 2 ? y11, y21, y3-1,
y4-1, y51 - We use the polynomial kernel of degree 2
- K(x,y) (xy1)2
- C is set to 100
- We first find ai (i1, , 5) by
66Example
- By using a QP solver, we get
- a10, a22.5, a30, a47.333, a54.833
- Note that the constraints are indeed satisfied
- The support vectors are x22, x45, x56
- The discriminant function is
- b is recovered by solving f(2)1 or by f(5)-1 or
by f(6)1, as x2 and x5 lie on the line
and x4 lies on the line
- All three give b9
67Example
Value of discriminant function
class 1
class 1
class 2
1
2
4
5
6
68????? ?? ??????
- ????? ???? ??? ????
- ?? ????? ??? ?????? ?? ??????? ?? ??? ??? ???????
??? ?? ????? ?? ???? 4 ?????.
69????? ??????? ?? SVM ???? ???? ????
- Prepare the data matrix
- Select the kernel function to use
- Execute the training algorithm using a QP solver
to obtain the ?i values - Unseen data can be classified using the ?i values
and the support vectors
70?????? ???? ????
- ??? ???? ????? ?? ??? SVM ?????? ???? ???? ???.
- ????? ? ???? ?????? ???? ??? ??? ????? ??? ????
- diffusion kernel, Fisher kernel, string kernel,
- ? ???????? ??? ???? ???? ????? ?????? ???? ?? ???
???? ??? ????? ?? ??? ????? ???. - ?? ???
- In practice, a low degree polynomial kernel or
RBF kernel with a reasonable width is a good
initial try - Note that SVM with RBF kernel is closely related
to RBF neural networks, with the centers of the
radial basis functions automatically chosen for
SVM
71SVM applications
- SVMs were originally proposed by Boser, Guyon and
Vapnik in 1992 and gained increasing popularity
in late 1990s. - SVMs are currently among the best performers for
a number of classification tasks ranging from
text to genomic data. - SVMs can be applied to complex data types beyond
feature vectors (e.g. graphs, sequences,
relational data) by designing kernel functions
for such data. - SVM techniques have been extended to a number of
tasks such as regression Vapnik et al. 97,
principal component analysis Schölkopf et al.
99, etc. - Most popular optimization algorithms for SVMs use
decomposition to hill-climb over a subset of ais
at a time, e.g. SMO Platt 99 and Joachims
99 - Tuning SVMs remains a black art selecting a
specific kernel and parameters is usually done in
a try-and-see manner.
72???? ??? ? ??? SVM
- Strengths
- Training is relatively easy
- Good generalization in theory and practice
- Work well with few training instances
- Find globally best model, No local optimal,
unlike in neural networks - It scales relatively well to high dimensional
data - Tradeoff between classifier complexity and error
can be controlled explicitly - Non-traditional data like strings and trees can
be used as input to SVM, instead of feature
vectors - Weaknesses
- Need to choose a good kernel function.
73????? ????
- SVMs find optimal linear separator
- They pick the hyperplane that maximises the
margin - The optimal hyperplane turns out to be a linear
combination of support vectors - The kernel trick makes SVMs non-linear learning
algorithms - Transform nonlinear problems to higher
dimensional space using kernel functions then
there is more chance that in the transformed
space the classes will be linearly separable.
74???? ???? ??? SVM
- How to use SVM for multi-class classification?
- One can change the QP formulation to become
multi-class - More often, multiple binary classifiers are
combined - One can train multiple one-versus-all
classifiers, or combine multiple pairwise
classifiers intelligently - How to interpret the SVM discriminant function
value as probability? - By performing logistic regression on the SVM
output of a set of data (validation set) that is
not used for training - Some SVM software (like libsvm) have these
features built-in
75Multi-class Classification
- SVM is basically a two-class classifier
- One can change the QP formulation to allow
multi-class classification - More commonly, the data set is divided into two
parts intelligently in different ways and a
separate SVM is trained for each way of division - Multi-class classification is done by combining
the output of all the SVM classifiers - Majority rule
- Error correcting code
- Directed acyclic graph
76??? ?????
- ????? ?? ??? ????? ??? ????? ?? ???????? ?? ????
??? ?????? - http//www.kernel-machines.org/software.html
- ???? ??? ??????? ???? LIBSVM ???????? ????? ???
???? ?? ??? ????. - ??? ????? SVMLight ?? ????? ?????? ???? ???? ???.
- ????? toolbox ?? Matlab ???? SVM ????? ??? ???.
77?????
- 1 b.E. Boser et al. A training algorithm for
optimal margin classifiers. Proceedings of the
fifth annual workshop on computational learning
theory 5 144-152, Pittsburgh, 1992. - 2 l. Bottou et al. Comparison of classifier
methods a case study in handwritten digit
recognition. Proceedings of the 12th IAPR
international conference on pattern recognition,
vol. 2, pp. 77-82. - 3 v. Vapnik. The nature of statistical learning
theory. 2nd edition, Springer, 1999.