Title: Linear discriminant or classification functions
1Linear discriminant (or classification) functions
Hwanjo Yu POSTECH (Pohang University of Science
and Technology) http//hwanjoyu.org
2Linear discriminant function
- A linear function classifies an object X ltx1,
x2, gt based on - b is bias, wi is weight of a feature xi
- X is positive If F(X) gt 0
- X is negative if F(X) lt 0
-
-
-
-
-
-
F
3Linear vs Quadratic vs Polynomial
- Linear a sum of weighted features
- Quadratic considers also concatenations of two
features - Polynomial considers concatenations of multiple
features (degree p denotes the size of
concatenations)
4Linear vs Quadratic vs Polynomial
- Linear and quadratic functions are a type of
polynomial function - Linear gt Polynomial with degree p1
- Quadratic gt Polynomial with degree p2
5Perceptron and Winnow
- Perceptron and Winnow Algorithms to learn a
linear function - Steps
- Initiate F by setting w and b randomly
- For each training point xi, If F misclassifies
xi, adjust w and b so that new F correctly
classifies xi - Repeat Step 2 until F correctly classifies all
the training points
6Perceptron and Winnow
- Nondeterministic
- Several parameters to tune such as of
iterations, how much adjust w. - For the same training set, they could generate
different F depending the data ordering. - Limited to linear functions of w for
polynomial function is too large to train in a
nondeterministic way - Inherently requires O(np) to learn a polynomial
function - SVM can learn a polynomial function in O(n)
7Support Vector Machine (SVM)
Hwanjo Yu POSTECH (Pohang University of Science
and Technology) http//hwanjoyu.org
8Linear vs Quadratic vs Polynomial
- Linear a sum of weighted features
- Quadratic considers also concatenations of two
features - Polynomial considers concatenations of multiple
features (degree p denotes the size of
concatenations)
9Linear vs Quadratic vs Polynomial
10Polynomial
- SVM learns a polynomial function with an
arbitrary degree in linear time
Overfitting
Underfitting
Polynomial degree
11Boundary of arbitrary shape
- SVM can learn a classification boundary of
arbitrary shape in a non-lazy fashion
12Linear Support Vector Machines
- Find a linear hyperplane (decision boundary) that
will separate the data F(X) b WX
13Linear Support Vector Machines
14Linear Support Vector Machines
- Another possible solution
15Linear Support Vector Machines
16Linear Support Vector Machines
- Which one is better? B1 or B2?
- How do you define better?
17Linear Support Vector Machines
Xi
F(Xi) / W
- Find hyperplane maximizes the margin gt B1 is
better than B2
18Linear Support Vector Machines
- Given a labeled dataset D(X1,y2), , (Xm,ym)
where Xi is a data vector, and yi is class label
(1 or -1) of Xi - D is linearly separable,
- if there exists a linear function F(X) gt bWX
that correctly classifies every data vector in D,
that is, - if there exists b and W that satisfies yi(bWXi)
gt 0 for all (Xi, yi) ? D, where b is a scalar and
W is a weight vector.
19Linear Support Vector Machines
- A dataset D(X1,y2), , (Xm,ym) is linearly
separable, if - there exists b and W that satisfies yi(bWXi) gt 0
for all (Xi, yi) ? D - Question
- If there exist b and W that satisfies yi(bWXi) gt
0 for all (Xi, yi) ? D, - Then there exist b and W that satisfies
yi(bWXi) ? 1 for all (Xi, yi) ? D ?
20Linear Support Vector Machines
21Linear Support Vector Machines
- We want to maximize
- Instead, we minimize
- But subjected to the following constraints
- This is a constrained optimization problem
- Numerical approaches to solve it (e.g., quadratic
programming)
22Linear Support Vector Machines
- What if the problem is not linearly separable?
23Linear Support Vector Machines
- What if the problem is not linearly separable?
- Introduce slack variables
- Need to minimize
- Subject to
24Linear Support Vector Machines
- Primal form
- Minimize
- Subject to
- To solve this, transform the primal to the dual
see the next slide
25Linear Support Vector Machines
- Dual form
- W can be recovered by
m
m
m
m
26Characteristics of the Solution
- Many of the ai are zero
- w is a linear combination of a small number of
data points - xi with non-zero ai are called support vectors
(SV) - The decision boundary is determined only by the
SV - For testing with a new data z
- Compute and
classify z as class 1 if the sum is positive, and
class 2 otherwise - Note w need not be formed explicitly
27SVM model
- SVM model
- Bias b, a list of support vectors and their
coefficients ? - Where is ? and How C affects the model?
- Why dont we compute w explicitly?
28Nonlinear Support Vector Machines
- What if decision boundary is not linear?
29Nonlinear Support Vector Machines
- Transform data into higher dimensional space
30Nonlinear Support Vector Machines
- A naive way
- Transform data into higher dimensional space
- Compute a linear boundary function in the new
feature space - the boundary function becomes nonlinear in the
original feature space gt very time consuming
though. - SVM Kernel trick
- Does all of these without explicitly transforming
data into higher dimensional space
31Nonlinear Support Vector Machines
m
m
m
32An Example for f(.) and K(.,.)
- Suppose f(.) is given as follows
- An inner product in the feature space is
- So, if we define the kernel function as follows,
there is no need to carry out f(.) explicitly - This use of kernel function to avoid carrying out
f(.) explicitly is known as the kernel trick
original feature space
new feature space
33Kernel Functions
- In practical use of SVM, the user specifies the
kernel function the transformation f(.) is not
explicitly stated - Another view kernel function, being an inner
product, is really a similarity measure between
the objects
34Examples of Kernel Functions
- Polynomial kernel with degree d
- Radial basis function (RBF) kernel with width s
- Closely related to kNN model and RBF neural
networks - The feature space is infinite-dimensional
- Sigmoid with parameter k and q
- It does not satisfy the Mercer condition on all k
and q
35Modification Due to Kernel Function
- Change all inner products to kernel functions
- For training,
m
m
Original
m
m
m
With kernel function
m
36Modification Due to Kernel Function
- For testing, the new data z is classified as
class 1 if f ³0, and as class 2 if f lt0
Original
With kernel function
37More on Kernel Functions
- Since the training of SVM only requires the value
of K(xi, xj), there is no restriction of the form
of xi and xj - xi can be a sequence or a tree, instead of a
feature vector - K(xi, xj) is just a similarity measure comparing
xi and xj - For a test object z, the discriminant function
essentially is a weighted sum of the similarity
between z and a pre-selected set of objects (the
support vectors)
38More on Kernel Functions
- Not all similarity measure can be used as kernel
function, however - The kernel function needs to satisfy the Mercer
function, i.e., the function is
positive-definite - This implies that the m by m kernel matrix, in
which the (i,j)-th entry is the K(xi, xj), is
always positive definite - This also means that the QP is convex and can be
solved in polynomial time
39Example
- Suppose we have 5 1D data points
- x11, x22, x34, x45, x56, with 1, 2, 6 as
class 1 and 4, 5 as class 2 ? y11, y21, y3-1,
y4-1, y51 - We use the polynomial kernel of degree 2
- K(x,y) (xy1)2
- C is set to 100
- We first find ai (i1, , 5) by
40Example
- By using a QP solver, we get
- a10, a22.5, a30, a47.333, a54.833
- Note that the constraints are indeed satisfied
- The support vectors are x22, x45, x56
- The discriminant function is
- b is recovered by solving f(2)1 or by f(5)-1 or
by f(6)1, as x2 and x5 lie on the line
and x4 lies on the line
- All three give b9
41Example
Value of discriminant function
class 1
class 1
class 2
1
2
4
5
6
42SVM model with RBF kernel
- As delta moves, the boundary shape changes
43Tuning parameters in SVM
- Soft margin parameter C
- SVM becomes soft, as C approaches to zero
- SVM becomes hard, as C goes up
- Polynomial kernel parameter d in
- As d increases, boundary becomes more complex
- Start with d 1 and increase d by 1
- The generalization performance will stop
improving at some point - RBF kernel parameter delta in
- No deterministic way to tune delta
- Start with a small number like 10-6, and multiply
by 10 at each iteration up to delta106 - Once you find a good range of delta, you can tune
it more finely within the range.
44SVM Implementation
- LIBSVM
- Highly optimized implementation
- Provide interfaces to other languages such as
Python, Java, etc. - Support various types of SVM such as multi-class
classification, nu-SVM, one-class SVM, etc. - SVM-light
- One of the earliest implementations thus also
widely used - Less efficient and accurate than libsvm
(according to my personal experience) - Support Ranking SVM we will discuss later