Support Vector Machines - PowerPoint PPT Presentation

About This Presentation

Title:

Support Vector Machines

Description:

Support Vector Machines H. Clara Pong Julie Horrocks1, Marianne Van den Heuvel2,Francis Tekpetey3, B. Anne Croy4. 1 Mathematics & Statistics, University of Guelph, – PowerPoint PPT presentation

Number of Views:142

Avg rating:3.0/5.0

Slides: 28

Provided by: HClar3

Category:

more less

Transcript and Presenter's Notes

Title: Support Vector Machines

1
Support Vector Machines

H. Clara Pong
Julie Horrocks1, Marianne Van den Heuvel2,Francis
Tekpetey3, B. Anne Croy4.
1 Mathematics Statistics, University of Guelph,
2 Biomedical Sciences, University of Guelph,
3 Obstetrics and Gynecology, University of
Western Ontario,
4 Anatomy Cell Biology, Queens University

2
Outline

Background
Separating Hyper-plane Basis Expansion
Support Vector Machines
Simulations
Remarks

3
Background

Motivation
The IVF (In-Vitro Fertilization) project
18 infertile women
each undergoing the IVF treatment
Outcome (Outputs, Ys) Binary (pregnancy)
Predictor (Inputs, Xs) Longitudinal data
(adhesion)

4
Background

Classification methods
Relatively new method Support Vector Machines
V. Vapnik first proposed in 1979
Maps input space into a high dimensional feature
space
Constructs a linear classifier in the new feature
space
Traditional method Discriminant Analysis
R.A. Fisher 1936
Classify according to the values from the
discriminant functions
Assumption the predictors X in a given class has
a Multi-Normal distribution.

5
Separating Hyper-plane

Suppose there are 2 classes (A, B)
y 1 for group A, y -1 for group B.
Let a hyper-plane be defined as f(X) ß0
ßTX 0
then f(X) is the decision boundary that
separates the two groups.
f(X) ß0 ßTX gt 0 for X ? A
f(X) ß0 ßTX lt 0 for X ? B

Given X0 ? A, misclassified when f(X0 ) lt
0. Given X0 ? B , misclassified when f(X0 ) gt 0.
6
Separating Hyper-plane

The perceptron learning algorithm search for
a hyper-plane that minimizes the distance of
misclassified points to the decision boundary.

However this does not provide a unique solution.
7
Optimal Separating Hyper-plane

Let C be the distance of the closest point from
the two groups to the
hyper-plane.
The Optimal Separating hyper-plane is the unique
separating
hyper-plane f(X) ß0 ßTX 0, where (ß0
,ßT) maximizes C.

8
Optimal Separating Hyper-plane

Maximization Problem

Subjects to 1. ai yi (xiTß ß0) -1 0
2. ai 0 all i1N 3. ß S i1..N ai
yixi 4. S i1..N ai yi 0 5. The
Kuhn Tucker Conditions f(X) only depends on the
xis where ai ? 0
9
Optimal Separating Hyper-plane
10
Basis Expansion

Suppose there are p inputs X(x1 xp)
Let hk(X) be a transformation that maps X from
Rp ?R.
hk(X) is called the basis function.
H h1(X), ,hm(X) is the basis of a new
feature space (dimm)

Example X(x1,x2) H h1(X), h2(X),h3(X)
h1(X) h1(x1,x2) x1, h2(X) h2(x1,x2) x2,
h3(X) h3(x1,x2) x1x2
X_new H(X) (x1, x2, x1x2)
11
Support Vector Machines

The optimal hyper-plane X f(X) ß0 ßTX0 .
f(X) ß0 ßTX is called the Support Vector
Classifier.

12
Support Vector Machines
Non-separable Case training data is
non-separable.
f(X) ß0 ßTX 0

Hyper-plane X f(X) ß0 ßTX 0

Xi crosses the margin of its group when C yi
f(Xi) gt 0.
Si C yi f(Xi) when Xi crosses the margin and
its zero when Xi outside.
Let ?iC Si, ?i is the proportional of C that the
prediction has crossed the margin. Misclassificat
ion occurs when Si gt C (?i gt 1).
13
Support Vector Machines
The overall misclassification is S?i , and is
bounded by d.
14
Support Vector Machines

SVM search for an optimal hyper-plane in a new
feature
space where the data are more separate.

Suppose H h1(X), ,hm(X) is the basis for
the new feature space F. All elements in the new
feature space is a linear basis expansion of X.
15
Support Vector Machines
The kernel and the basis transformation define
one another.
16
Support Vector Machines

Dual LaGrange function

This shows the basis transformation in SVM does
not need to be define explicitly.
17
Simulations

3 cases
100 simulations per case
Each simulation consists of 200 points
100 points from each group
Input space 2 dimensional
Output 0 or 1 (2 groups)
Half of the points are randomly selected as the
training set.

X(x1,x2), Y ? 0,1
18
Simulations

Case 1 (Normal with same covariance matrix)

19
Simulations

Case 1

Misclassifications (in 100 simulations) Misclassifications (in 100 simulations) Misclassifications (in 100 simulations) Misclassifications (in 100 simulations) Misclassifications (in 100 simulations)
Training Training Testing Testing
Mean Sd Mean Sd
LDA 7.85 2.65 8.07 2.51
SVM 6.98 2.33 8.48 2.81
20
Simulations

Case 2 (Normal with unequal covariance matrixes)

21
Simulations

Case 2

Misclassifications (in 100 simulations) Misclassifications (in 100 simulations) Misclassifications (in 100 simulations) Misclassifications (in 100 simulations) Misclassifications (in 100 simulations)
Training Training Testing Testing
Mean Sd Mean Sd
QDA 15.5 3.75 16.84 3.48
SVM 13.6 4.03 18.8 4.01
22
Simulations

Case 3 (Non-normal)

23
Simulations

Case 3

Misclassifications (in 100 simulations) Misclassifications (in 100 simulations) Misclassifications (in 100 simulations) Misclassifications (in 100 simulations) Misclassifications (in 100 simulations)
Training Training Testing Testing
Mean Sd Mean Sd
QDA 14 3.79 16.8 3.63
SVM 9.34 3.46 14.8 3.21
24
Simulations

Paired t-test for differences in
misclassifications
Ho mean different 0 Ha mean different ? 0
Case 1
mean different (LDA - SVM) - 0.41 , se
0.3877
t -1.057, p-value 0.29
(insignificant)
Case 2
mean different (QDA - SVM) -1.96 , se
0.4170
t -4.70, p-value 8.42e-06 (significant)
Case 3
mean different (QDA - SVM) 2, sd 0.4218
t 4.74, p-value 7.13e-06 (significant)

25
Remarks

Support Vector Machines
Maps the original input space onto a feature
space of higher dimension
No assumption on the distributions of Xs

Performance
The performances of Discriminant Analysis and SVM
are similar (when (XY) has a Normal distribution
and share the same S)
Discriminant Analysis has a better performance
(when the covariance matrices for the two groups
are different)
SVM has a better performance
(when the input (X) violated the
distribution assumption)

26
Reference

N. Cristianini, and J. Shawe-Taylor An
introduction to Support Vector Machines and other
kernel-based learning methods. New York
Cambridge University Press, 2000.
J. Friedman, T. Hastie, and R. Tibshirani The
Elements of Statistical Learning. NewYork
Springer, 2001.
D. Meyer, C. Chang, and C. Lin. R Documentation
Support Vector Machines. http//www.maths.lth.se/h
elp/R/.R/library/e1071/html/svm.html
Last updated March 2006
H. Planatscher and J. Dietzsch. SVM-Tutorial
using R (e1071-package) http//www.potschi.de
/svmtut/svmtut.htm
M. Van Den Heuvel, J. Horrocks, S. Bashar, S.
Taylor, S. Burke, K. Hatta, E. Lewis, and A.
Croy. Menstrual Cycle Hormones Induce Changes in
Functional Interac-tions Between Lymphocytes and
Endothelial Cells. Journal of Clinical
Endocrinology and Metabolism, 2005.