Linear discriminant or classification functions - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Linear discriminant or classification functions

Description:

Several parameters to tune such as # of iterations, how much adjust w. ... Once you find a good range of delta, you can tune it more finely within the range. ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 45

Provided by: hwan8

Category:

more less

Transcript and Presenter's Notes

Title: Linear discriminant or classification functions

1
Linear discriminant (or classification) functions
Hwanjo Yu POSTECH (Pohang University of Science
and Technology) http//hwanjoyu.org
2
Linear discriminant function

A linear function classifies an object X ltx1,
x2, gt based on
b is bias, wi is weight of a feature xi
X is positive If F(X) gt 0
X is negative if F(X) lt 0

-
-
-
-
-
-

F
3
Linear vs Quadratic vs Polynomial

Linear a sum of weighted features
Quadratic considers also concatenations of two
features
Polynomial considers concatenations of multiple
features (degree p denotes the size of
concatenations)

4
Linear vs Quadratic vs Polynomial

Linear and quadratic functions are a type of
polynomial function
Linear gt Polynomial with degree p1
Quadratic gt Polynomial with degree p2

5
Perceptron and Winnow

Perceptron and Winnow Algorithms to learn a
linear function
Steps
Initiate F by setting w and b randomly
For each training point xi, If F misclassifies
xi, adjust w and b so that new F correctly
classifies xi
Repeat Step 2 until F correctly classifies all
the training points

6
Perceptron and Winnow

Nondeterministic
Several parameters to tune such as of
iterations, how much adjust w.
For the same training set, they could generate
different F depending the data ordering.
Limited to linear functions of w for
polynomial function is too large to train in a
nondeterministic way
Inherently requires O(np) to learn a polynomial
function
SVM can learn a polynomial function in O(n)

7
Support Vector Machine (SVM)
Hwanjo Yu POSTECH (Pohang University of Science
and Technology) http//hwanjoyu.org
8
Linear vs Quadratic vs Polynomial

Linear a sum of weighted features
Quadratic considers also concatenations of two
features
Polynomial considers concatenations of multiple
features (degree p denotes the size of
concatenations)

9
Linear vs Quadratic vs Polynomial
10
Polynomial

SVM learns a polynomial function with an
arbitrary degree in linear time

Overfitting
Underfitting
Polynomial degree
11
Boundary of arbitrary shape

SVM can learn a classification boundary of
arbitrary shape in a non-lazy fashion

12
Linear Support Vector Machines

Find a linear hyperplane (decision boundary) that
will separate the data F(X) b WX

13
Linear Support Vector Machines

One Possible Solution

14
Linear Support Vector Machines

Another possible solution

15
Linear Support Vector Machines

Other possible solutions

16
Linear Support Vector Machines

Which one is better? B1 or B2?
How do you define better?

17
Linear Support Vector Machines
Xi
F(Xi) / W

Find hyperplane maximizes the margin gt B1 is
better than B2

18
Linear Support Vector Machines

Given a labeled dataset D(X1,y2), , (Xm,ym)
where Xi is a data vector, and yi is class label
(1 or -1) of Xi
D is linearly separable,
if there exists a linear function F(X) gt bWX
that correctly classifies every data vector in D,
that is,
if there exists b and W that satisfies yi(bWXi)
gt 0 for all (Xi, yi) ? D, where b is a scalar and
W is a weight vector.

19
Linear Support Vector Machines

A dataset D(X1,y2), , (Xm,ym) is linearly
separable, if
there exists b and W that satisfies yi(bWXi) gt 0
for all (Xi, yi) ? D
Question
If there exist b and W that satisfies yi(bWXi) gt
0 for all (Xi, yi) ? D,
Then there exist b and W that satisfies
yi(bWXi) ? 1 for all (Xi, yi) ? D ?

20
Linear Support Vector Machines
21
Linear Support Vector Machines

We want to maximize
Instead, we minimize
But subjected to the following constraints
This is a constrained optimization problem
Numerical approaches to solve it (e.g., quadratic
programming)

22
Linear Support Vector Machines

What if the problem is not linearly separable?

23
Linear Support Vector Machines

What if the problem is not linearly separable?
Introduce slack variables
Need to minimize
Subject to

24
Linear Support Vector Machines

Primal form
Minimize
Subject to
To solve this, transform the primal to the dual
see the next slide

25
Linear Support Vector Machines

Dual form
W can be recovered by

m
m
m
m
26
Characteristics of the Solution

Many of the ai are zero
w is a linear combination of a small number of
data points
xi with non-zero ai are called support vectors
(SV)
The decision boundary is determined only by the
SV
For testing with a new data z
Compute and
classify z as class 1 if the sum is positive, and
class 2 otherwise
Note w need not be formed explicitly

27
SVM model

SVM model
Bias b, a list of support vectors and their
coefficients ?
Where is ? and How C affects the model?
Why dont we compute w explicitly?

28
Nonlinear Support Vector Machines

What if decision boundary is not linear?

29
Nonlinear Support Vector Machines

Transform data into higher dimensional space

30
Nonlinear Support Vector Machines

A naive way
Transform data into higher dimensional space
Compute a linear boundary function in the new
feature space
the boundary function becomes nonlinear in the
original feature space gt very time consuming
though.
SVM Kernel trick
Does all of these without explicitly transforming
data into higher dimensional space

31
Nonlinear Support Vector Machines

Dual form

m
m
m
32
An Example for f(.) and K(.,.)

Suppose f(.) is given as follows
An inner product in the feature space is
So, if we define the kernel function as follows,
there is no need to carry out f(.) explicitly
This use of kernel function to avoid carrying out
f(.) explicitly is known as the kernel trick

original feature space
new feature space
33
Kernel Functions

In practical use of SVM, the user specifies the
kernel function the transformation f(.) is not
explicitly stated
Another view kernel function, being an inner
product, is really a similarity measure between
the objects

34
Examples of Kernel Functions

Polynomial kernel with degree d
Radial basis function (RBF) kernel with width s
Closely related to kNN model and RBF neural
networks
The feature space is infinite-dimensional
Sigmoid with parameter k and q
It does not satisfy the Mercer condition on all k
and q

35
Modification Due to Kernel Function

Change all inner products to kernel functions
For training,

m
m
Original
m
m
m
With kernel function
m
36
Modification Due to Kernel Function

For testing, the new data z is classified as
class 1 if f ³0, and as class 2 if f lt0

Original
With kernel function
37
More on Kernel Functions

Since the training of SVM only requires the value
of K(xi, xj), there is no restriction of the form
of xi and xj
xi can be a sequence or a tree, instead of a
feature vector
K(xi, xj) is just a similarity measure comparing
xi and xj
For a test object z, the discriminant function
essentially is a weighted sum of the similarity
between z and a pre-selected set of objects (the
support vectors)

38
More on Kernel Functions

Not all similarity measure can be used as kernel
function, however
The kernel function needs to satisfy the Mercer
function, i.e., the function is
positive-definite
This implies that the m by m kernel matrix, in
which the (i,j)-th entry is the K(xi, xj), is
always positive definite
This also means that the QP is convex and can be
solved in polynomial time

39
Example

Suppose we have 5 1D data points
x11, x22, x34, x45, x56, with 1, 2, 6 as
class 1 and 4, 5 as class 2 ? y11, y21, y3-1,
y4-1, y51
We use the polynomial kernel of degree 2
K(x,y) (xy1)2
C is set to 100
We first find ai (i1, , 5) by

40
Example

By using a QP solver, we get
a10, a22.5, a30, a47.333, a54.833
Note that the constraints are indeed satisfied
The support vectors are x22, x45, x56
The discriminant function is
b is recovered by solving f(2)1 or by f(5)-1 or
by f(6)1, as x2 and x5 lie on the line
and x4 lies on the line
All three give b9

41
Example
Value of discriminant function
class 1
class 1
class 2
1
2
4
5
6
42
SVM model with RBF kernel

As delta moves, the boundary shape changes

43
Tuning parameters in SVM

Soft margin parameter C
SVM becomes soft, as C approaches to zero
SVM becomes hard, as C goes up
Polynomial kernel parameter d in
As d increases, boundary becomes more complex
Start with d 1 and increase d by 1
The generalization performance will stop
improving at some point
RBF kernel parameter delta in
No deterministic way to tune delta
Start with a small number like 10-6, and multiply
by 10 at each iteration up to delta106
Once you find a good range of delta, you can tune
it more finely within the range.

44
SVM Implementation

LIBSVM
Highly optimized implementation
Provide interfaces to other languages such as
Python, Java, etc.
Support various types of SVM such as multi-class
classification, nu-SVM, one-class SVM, etc.
SVM-light
One of the earliest implementations thus also
widely used
Less efficient and accurate than libsvm
(according to my personal experience)
Support Ranking SVM we will discuss later