Roughly overview of Support vector machines

About This Presentation

Title:

Roughly overview of Support vector machines

Description:

Christopher D. Manning, Prabhakar Raghavan and Hinrich Sch tze. ... The subject have started in the late seventies by Vapnik (1979) Master : Mathematics ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 31

Provided by: sil973

Category:

more less

Transcript and Presenter's Notes

Title: Roughly overview of Support vector machines

1
Roughly overview of Support vector machines

Reference
Support vector machines and machine learning on
documents. Christopher D. Manning, Prabhakar
Raghavan and Hinrich Schütze. An Introduction of
Information Retrieval, 2008.
Support Vector Machines Training and
Application. E. Osuna, et al. MIT A. I. Lab,
1997.
An Improved Training Algorithm for Support Vector
Machines. E. Osuna, et al. IEEE NNSP97.
A Tutorial on Support Vector Machines for Pattern
Recognition. J.C. Burges. Data Mining and
Knowledge Discovery, 1998.
A probabilistic Analysis of the Rocchio Algorithm
with TFIDF for Text Categorization. T.Joachims.
NIPS, 1997.
Text Categorization with Support Vector Machines
Learning with Many Relevant Features. T.Joachims.
1997.
http//www-csli.stanford.edu/hinrich/information-
retrieval-book.html
http//www-csli.stanford.edu/hinrich/newslides.ht
ml
http//en.wikipedia.org/wiki/Quadratic_programming
http//www.cmlab.csie.ntu.edu.tw/cyy/learning/tut
orials/SVM3.pdf

Presenter Suhan Yu
2
The main idea of SVM

An SVM is a kind of large-margin classifier
To find a decision boundary between two classes
The subject have started in the late seventies by
Vapnik (1979)

Vladimir Naumovich Vapnik
Russian
Master Mathematics
Ph. D Statistics
3
The application of SVM

Isolated handwritten digit recognition
Object recognition
Speaker identification
Face detection
Text categorization
Joachims, 1997

4
Text classification

Earlier
TFIDF classifier
k-NN

5
Text classification

Earlier
Naïve Bayes Classifier
Rocchio
Today
SVM

6
Why should SVMs Work Well for Text categorization

High dimension input space
Learning text classifiers has to deal with more
than 10000 features
Few irrelevant features
The relation between features is high
Document vectors are sparse

7
The main idea of SVM
margin
hyperplane
8
Support Vector Machine (SVM)

SVMs maximize the margin around the separating
hyperplane.
A.k.a. large margin classifiers
The decision function is fully specified by a
subset of training samples, the support vectors.
Quadratic programming problem

9
Maximum Margin Formalization

w decision hyperplane normal
xi data point i
yi class of data point i (1 or -1) NB Not
1/0
Classifier is f(xi) sign(wTxi b)
Functional margin of xi is yi (wTxi b)

10
The planar decision surface in data-space for the
simple linear discriminant function
X
11
Linear Support Vector Machine (SVM)
wTxa b 1
?

Hyperplane
wT x b 0
Extra scale constraint
mini1,,n wTxi b 1
This implies
wT(xaxb) 2
? xaxb2 2/w2

wTxb b -1
wT x b 0
12
Linear SVM Mathematically

Assume that all data is at least distance 1 from
the hyperplane, then the following two
constraints follow for a training set (xi ,yi)
For support vectors, the inequality becomes an
equality
Then, since each examples distance from the
hyperplane is
The margin is

wTxi b 1 if yi 1 wTxi b -1 if yi
-1
13
Geometric Margin

Distance from example to the separator is
Examples closest to the hyperplane are support
vectors.
Margin ? of the separator is the width of
separation between support vectors of classes.

x
r
x'
14
Linear SVM Mathematically

To summarize
Quadratic function
A quadratic function f is a function of the form

a point x to be a global minimizer is for it to
satisfy the Karush-Kuhn-Tucker (KKT)
conditions. The KKT conditions are also
sufficient when f(x) is convex.
Convex function
15
Linear SVM Mathematically

Lagrange Multiplier
Differentiating

16
An example of SVM
?
17
Non-linear SVMs

Datasets that are linearly separable (with some
noise) work out great
But what are we going to do if the dataset is
just too hard?
How about mapping data to a higher-dimensional
space

x2
x
0
18
Nonlinear SVMs

Project the linearly inseparable data to high
dimensional space where it is linearly separable
and then we can use linear SVM

19
Not linearly separable data.
Linearly separable data.
Angular degree (phase)
polar coordinates
0
5
Distance from center (radius)
Need to transform the coordinates polar
coordinates, kernel transformation into higher
dimensional space (support vector machines).
20
Non-linear SVMs Feature spaces
F x ? f(x)
21
(contd)

Kernel functions and the kernel trick are used to
transform data into a different linearly
separable feature space

22
Soft Margin Classification

If the training set is not linearly separable,
slack variables ?i can be added to allow
misclassification of difficult or noisy examples.
Allow some errors
Let some points be moved to where they belong, at
a cost
Still, try to minimize training set errors, and
to place hyperplane far from each class (large
margin)

?i
?j
23
Soft Margin Classification Mathematically

The old formulation
The new formulation incorporating slack
variables
Parameter C can be viewed as a way to control
overfitting a regularization term

Find w and b such that F(w) ½ wTw is minimized
and for all (xi ,yi) yi (wTxi b) 1
Find w and b such that F(w) ½ wTw CS?i is
minimized and for all (xi ,yi) yi (wTxi b)
1- ?i and ?i 0 for all i
24
Soft Margin Classification Solution

The dual problem for soft margin classification
Neither slack variables ?i nor their Lagrange
multipliers appear in the dual problem!
Again, xi with non-zero ai will be support
vectors.
Solution to the dual problem is

Find a1aN such that Q(a) Sai -
½SSaiajyiyjxiTxj is maximized and (1) Saiyi
0 (2) 0 ai C for all ai
w Saiyixi b yk(1- ?k) - wTxk
where k argmax ak
But w not needed explicitly for classification!
f(x) SaiyixiTx b
k
25
Classification with SVMs

Given a new point (x1,x2), we can score its
projection onto the hyperplane normal
In 2 dims score w1x1w2x2b.
I.e., compute score wx b SaiyixiTx b
Set confidence threshold t.

Score gt t yes Score lt -t no Else dont know
7
5
3
26
Kernels

Why use kernels?
Make non-separable problem separable.
Map data into better representational space
Common kernels
Linear
Polynomial K(x,z) (1xTz)d
Radial basis function (infinite dimensional
space)

27
The problem of SVM

Training a SVM using large data sets (5000
samples) is a very difficult problem to approach
without some kind of data or problem
decomposition Osuna, 1997

28
Features for text

Good feature engineering can often markedly
improve the performance of a text classifier
Use terms as features
Document zones
Upweighting document zones
Separate features spaces for document zones
Connections to text summarization
Relevance signal
Cosine score
Title match
Query term proximity is often very indicative of
a document being in topic, especially with longer
documents and on the web

29
Result ranking by machine learning

Classification problem v.s. regression problem
Classification problem categorical variable is
predicted
Regression problem a real number is predicted
Ordinal regression
Ranking is predicted
The goal is to rank a set of documents for a
query
Ranking SVM

30
Ranking SVM

Construct a vector of features
for each document/query pair
For two documents, form the vector of feature
differences
Another ranking methods
RankNet using neural network for ranking
Frank different from cost function

Write a Comment

User Comments (0)

About PowerShow.com

Roughly overview of Support vector machines - PowerPoint PPT Presentation

Roughly overview of Support vector machines

Christopher D. Manning, Prabhakar Raghavan and Hinrich Sch tze. ... The subject have started in the late seventies by Vapnik (1979) Master : Mathematics ... – PowerPoint PPT presentation