Roughly overview of Support vector machines - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Roughly overview of Support vector machines

Description:

Christopher D. Manning, Prabhakar Raghavan and Hinrich Sch tze. ... The subject have started in the late seventies by Vapnik (1979) Master : Mathematics ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 31
Provided by: sil973
Category:

less

Transcript and Presenter's Notes

Title: Roughly overview of Support vector machines


1
Roughly overview of Support vector machines
  • Reference
  • Support vector machines and machine learning on
    documents. Christopher D. Manning, Prabhakar
    Raghavan and Hinrich Schütze. An Introduction of
    Information Retrieval, 2008.
  • Support Vector Machines Training and
    Application. E. Osuna, et al. MIT A. I. Lab,
    1997.
  • An Improved Training Algorithm for Support Vector
    Machines. E. Osuna, et al. IEEE NNSP97.
  • A Tutorial on Support Vector Machines for Pattern
    Recognition. J.C. Burges. Data Mining and
    Knowledge Discovery, 1998.
  • A probabilistic Analysis of the Rocchio Algorithm
    with TFIDF for Text Categorization. T.Joachims.
    NIPS, 1997.
  • Text Categorization with Support Vector Machines
    Learning with Many Relevant Features. T.Joachims.
    1997.
  • http//www-csli.stanford.edu/hinrich/information-
    retrieval-book.html
  • http//www-csli.stanford.edu/hinrich/newslides.ht
    ml
  • http//en.wikipedia.org/wiki/Quadratic_programming
  • http//www.cmlab.csie.ntu.edu.tw/cyy/learning/tut
    orials/SVM3.pdf

Presenter Suhan Yu
2
The main idea of SVM
  • An SVM is a kind of large-margin classifier
  • To find a decision boundary between two classes
  • The subject have started in the late seventies by
    Vapnik (1979)

Vladimir Naumovich Vapnik
Russian
Master Mathematics
Ph. D Statistics
3
The application of SVM
  • Isolated handwritten digit recognition
  • Object recognition
  • Speaker identification
  • Face detection
  • Text categorization
  • Joachims, 1997

4
Text classification
  • Earlier
  • TFIDF classifier
  • k-NN

5
Text classification
  • Earlier
  • Naïve Bayes Classifier
  • Rocchio
  • Today
  • SVM

6
Why should SVMs Work Well for Text categorization
  • High dimension input space
  • Learning text classifiers has to deal with more
    than 10000 features
  • Few irrelevant features
  • The relation between features is high
  • Document vectors are sparse

7
The main idea of SVM
margin
hyperplane
8
Support Vector Machine (SVM)
  • SVMs maximize the margin around the separating
    hyperplane.
  • A.k.a. large margin classifiers
  • The decision function is fully specified by a
    subset of training samples, the support vectors.
  • Quadratic programming problem

9
Maximum Margin Formalization
  • w decision hyperplane normal
  • xi data point i
  • yi class of data point i (1 or -1) NB Not
    1/0
  • Classifier is f(xi) sign(wTxi b)
  • Functional margin of xi is yi (wTxi b)

10
The planar decision surface in data-space for the
simple linear discriminant function
X
11
Linear Support Vector Machine (SVM)
wTxa b 1
?
  • Hyperplane
  • wT x b 0
  • Extra scale constraint
  • mini1,,n wTxi b 1
  • This implies
  • wT(xaxb) 2
  • ? xaxb2 2/w2

wTxb b -1
wT x b 0
12
Linear SVM Mathematically
  • Assume that all data is at least distance 1 from
    the hyperplane, then the following two
    constraints follow for a training set (xi ,yi)
  • For support vectors, the inequality becomes an
    equality
  • Then, since each examples distance from the
    hyperplane is
  • The margin is

wTxi b 1 if yi 1 wTxi b -1 if yi
-1
13
Geometric Margin
  • Distance from example to the separator is
  • Examples closest to the hyperplane are support
    vectors.
  • Margin ? of the separator is the width of
    separation between support vectors of classes.

x
r
x'
14
Linear SVM Mathematically
  • To summarize
  • Quadratic function
  • A quadratic function f is a function of the form

a point x to be a global minimizer is for it to
satisfy the Karush-Kuhn-Tucker (KKT)
conditions. The KKT conditions are also
sufficient when f(x) is convex.
Convex function
15
Linear SVM Mathematically
  • Lagrange Multiplier
  • Differentiating

16
An example of SVM
?
17
Non-linear SVMs
  • Datasets that are linearly separable (with some
    noise) work out great
  • But what are we going to do if the dataset is
    just too hard?
  • How about mapping data to a higher-dimensional
    space

x2
x
0
18
Nonlinear SVMs
  • Project the linearly inseparable data to high
    dimensional space where it is linearly separable
    and then we can use linear SVM

19
Not linearly separable data.
Linearly separable data.
Angular degree (phase)
polar coordinates
0
5
Distance from center (radius)
Need to transform the coordinates polar
coordinates, kernel transformation into higher
dimensional space (support vector machines).
20
Non-linear SVMs Feature spaces
F x ? f(x)
21
(contd)
  • Kernel functions and the kernel trick are used to
    transform data into a different linearly
    separable feature space

22
Soft Margin Classification
  • If the training set is not linearly separable,
    slack variables ?i can be added to allow
    misclassification of difficult or noisy examples.
  • Allow some errors
  • Let some points be moved to where they belong, at
    a cost
  • Still, try to minimize training set errors, and
    to place hyperplane far from each class (large
    margin)

?i
?j
23
Soft Margin Classification Mathematically
  • The old formulation
  • The new formulation incorporating slack
    variables
  • Parameter C can be viewed as a way to control
    overfitting a regularization term

Find w and b such that F(w) ½ wTw is minimized
and for all (xi ,yi) yi (wTxi b) 1
Find w and b such that F(w) ½ wTw CS?i is
minimized and for all (xi ,yi) yi (wTxi b)
1- ?i and ?i 0 for all i
24
Soft Margin Classification Solution
  • The dual problem for soft margin classification
  • Neither slack variables ?i nor their Lagrange
    multipliers appear in the dual problem!
  • Again, xi with non-zero ai will be support
    vectors.
  • Solution to the dual problem is

Find a1aN such that Q(a) Sai -
½SSaiajyiyjxiTxj is maximized and (1) Saiyi
0 (2) 0 ai C for all ai
w Saiyixi b yk(1- ?k) - wTxk
where k argmax ak
But w not needed explicitly for classification!
f(x) SaiyixiTx b
k
25
Classification with SVMs
  • Given a new point (x1,x2), we can score its
    projection onto the hyperplane normal
  • In 2 dims score w1x1w2x2b.
  • I.e., compute score wx b SaiyixiTx b
  • Set confidence threshold t.

Score gt t yes Score lt -t no Else dont know
7
5
3
26
Kernels
  • Why use kernels?
  • Make non-separable problem separable.
  • Map data into better representational space
  • Common kernels
  • Linear
  • Polynomial K(x,z) (1xTz)d
  • Radial basis function (infinite dimensional
    space)

27
The problem of SVM
  • Training a SVM using large data sets (5000
    samples) is a very difficult problem to approach
    without some kind of data or problem
    decomposition Osuna, 1997

28
Features for text
  • Good feature engineering can often markedly
    improve the performance of a text classifier
  • Use terms as features
  • Document zones
  • Upweighting document zones
  • Separate features spaces for document zones
  • Connections to text summarization
  • Relevance signal
  • Cosine score
  • Title match
  • Query term proximity is often very indicative of
    a document being in topic, especially with longer
    documents and on the web

29
Result ranking by machine learning
  • Classification problem v.s. regression problem
  • Classification problem categorical variable is
    predicted
  • Regression problem a real number is predicted
  • Ordinal regression
  • Ranking is predicted
  • The goal is to rank a set of documents for a
    query
  • Ranking SVM

30
Ranking SVM
  • Construct a vector of features
    for each document/query pair
  • For two documents, form the vector of feature
    differences
  • Another ranking methods
  • RankNet using neural network for ranking
  • Frank different from cost function
Write a Comment
User Comments (0)
About PowerShow.com