Title: SVM Classifier Introduction
1SVM ClassifierIntroduction
- Linear SVM (separable data)
- Linear SVM (nonseparable data)
- Nonlinear SVM (nonseparable data)
2(No Transcript)
31) Linear SVM (separable data)
- Hyperplane definition
- Maximum margin
- Scaling
- Final formula
4Hyperplane definition
- If data are linear separable, an hyperplane
f(x)wxb 0 exists, such that
5w and maximum margin
- Given a point x and an hyperplane wxb the
distance is - that is a function of
- Find the hyperplane that classifies correctly the
training set and that has a minimum norma
(minimum w2), it means to find the maximum
margin from the points of the training set.
6Maximum margin
- It is proved that the capacity of generalization
of SVM arise, meanwhile the margin arises. - So we obtain the maximum generalization when the
hyperplane has the maximum margin. This is the
case of the optimal separating hyperplane (OSH).
7Maximum margin
8Final formula
9Lagrangian solution
- We must minimize the lagrangian
- This is a QP problem with solution
- Where are lagrange multiplicator variables.
10Optimum hyperplane
- So the final optimum hyperplane is
112) Linear SVM (nonseparable data)
- Generalization, soft margins, slacks.
- Formulation
12Optimum hyperplane generalization
- In this phase, the constraint for the exact
classification is relaxed (so we are talking
about soft margins). - We introduce slack variables , so the
constraint is
13Optimum hyperplane generalization
- At different values of the slacks variables we
have - Over the margin and correct classification
- Inside the margin and correct classification
- Incorrect classification
14Soft margins and slacks
15Formulation - Primal
16Lagrangian solution
- We must minimize the lagrangian
- This is a QP problem with the same solution
- But now
- C manages the compromise between the size of the
margin (lower values of C) and the number of
errors tollerated in the training phase (if C
we have an hyperplane perfectly separable).
17Final optimum hyperplane
- So the final optimum hyperplane is the same
183) NonLinear SVM (nonseparable data)
- Mapping F(x)
- Kernel functions
- Loss functions
- Formulation
19Mapping F(x)
- In the case of sets not linear separable, we
introduce a mapping F(x) in a multi dimensional
space in order to obtain sets linearly separable. - So instead of arise the classifier complexity (it
is again an hyperplane) we arise the features
dimensional space.
20Mapping F(x)
Mapping F(x)
Mapping F(x)
21Kernel functions
- The new transformed space can have a very big
dimension, so the mapping function F(.) can be
very complex to evaluate. - In the learning and classification phase we have
to manage the scalar product F(x)F(y). - Following the Mercer theorem, it exists a Kernel
function K(x,y) such that K(x,y)F(x)F(y). - So the discriminant function is
-
- So the use of Kernel functions narrow the real
mapping in the n-dimensional space
22Some Kernel functions
- Linear Kernel
- Polinomial Kernel
- Gaussian Kernel (RBF Radial Basis Function)
- Kernel MLP (Multi-layer perceptron)
23Loss functions
- Consider a true multilabel y and a predicted one
t. - The basic goal is to learn a function fy to
approximate the unknown target function ty - To evaluate the goodness of the approximation fy
we need the loss function l(y,t) denoting the
price we have to pay guessing that the associated
label
24Loss functions conditions
- Basic condition of the loss function
-
- should be monotonically decreasing with
respect to the sets of incorrect multilabels.
25Final formulation primal
- With a single slack variable for each training
example.
26Langrangian solution
- Now we must minimize the lagrangian
- This is a QP problem with the solution
- So the final optimum hyperplane is
27Learning Hierarchical Multi-Category Text
Classification Models
- Juho Rousu, Craig Saunders, Sandor Szedmak, John
Shawe-Taylor
Proceedings of the 22nd International Conference
on Machine Learning (ICML05), Bonn, Germany - 2005
28Hierarchical Multilabel Classificationunion of
partial paths model
- Goal given document x and hierarchy T (V,E)
predict multilabel where
the positive microlabels yk form a union of
partial paths in T.
A news article about David and Victoria Beckham
could belong to different partial paths and might
not belong to any leaf categories.
29Frequently used learning strategies for
hierarchies
- Flatten the hierarchy learn each microlabel
independently with classification learner of your
choice. - Computationally relatively inexpensive.
- Does not make use of the dependencies between
microlabels. - Hierarchical training train a node j with
example (x,y) that belong to the parent, so that - Some of the microlabels dependencies are learned.
- However, training data fragments toward the
leaves, hence the estimation becomes less
reliable. - Model is not explicitly trained in terms of a
loss function for the hierarchy. - They try to improve these approaches.
30Multi-Classification
- Multi Classification
- multilabel is a union of partial paths in the
hierarchy. - Results post-processed
- if the label of a node is predicted as -1 then
all descendants of that node are also labelled
negatively (done to obtain good accuracy).
31Loss functions for multilabel classifications
- Consider a true multilabel and a predicted one .
- There are many choices
- Zero one-loss
- Symmetric difference loss
- They dont take the hierarchy into account
32Hierarchical loss functions
- Goal take the hierarchy into account.
- Hierarchical loss the first mistake along a path
is penalized. (Cesa-Bianchi) - Simplified hierarchical loss mistake in the
child is penalized if the parent was correct.
33Coefficient cj
- The coefficients cj are used for down-scaling the
loss when going deeper in the tree. These can be
chosen in many ways - Uniform loss
- Siblings loss
- Subtree loss
34Maximum margin learning
- The model class is defined on the edges of a
Markov tree T(V,E) - F(x) vector representation for the document x
(bag of words). In the training data some F(x)
are duplicated with different weights. - Maximize the ratio between the probability of the
correct labeling yi and the worst competing
labeling y - With the exponential family, the problem
translates in maximize the minimum linear margin
35Optimization problem Primal
- Using a single slack variable for each training
example
36Optimization problem - Dual
- Where K is the joint kernel
- Exponential number (in size of the hierarchy) of
primal constraints and dual variables, one per
example.
37Marginalized problem
- To obtain a polinomial size problem
- Edge-marginals of dual variables.
- Loss function decomposed by edges.
- Kernel decomposed by edges.
- Conditional Gradient descent to optimize the
marginalized problem (few iterations used to
update variables).
38Prediction quality
REUTERS DATASET
WIPO DATASET
Flat SVM obtains highest precision, but the
lowest recall and F1. The F1 values are similar
for all the hierarchical models.
39References
- Rousu, J., Saunders, C., Szedmak, S. and
Shawe-Taylor, J. (2004) On Maximum Margin
Hierarchical Classification. In Proceedings of
Workshop on Learning with Structured Outputs at
NIPS 2004, Whistler. - Juho Rousu, Craig Saunders, Sándor Szedmák, John
Shawe-Taylor Kernel-Based Learning of
Hierarchical Multilabel Classification Models.
Journal of Machine Learning Research 7 1601-1626
(2006) - Cesa-Bianchi, N., Gentile, C., Tironi, A.,
Zaniboni, L. (2004). Incremental algorithms for
hierarchical classification. Neural Information
Processing Systems. - N. Cesa-Bianchi, C. Gentile, and L. Zaniboni
Incremental Algorithms for Hierarchical
Classification. Journal of Machine Learning
Research, 731--54, 2006.