Linear discriminant or classification functions - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Linear discriminant or classification functions

Description:

Several parameters to tune such as # of iterations, how much adjust w. ... Once you find a good range of delta, you can tune it more finely within the range. ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 45
Provided by: hwan8
Category:

less

Transcript and Presenter's Notes

Title: Linear discriminant or classification functions


1
Linear discriminant (or classification) functions
Hwanjo Yu POSTECH (Pohang University of Science
and Technology) http//hwanjoyu.org
2
Linear discriminant function
  • A linear function classifies an object X ltx1,
    x2, gt based on
  • b is bias, wi is weight of a feature xi
  • X is positive If F(X) gt 0
  • X is negative if F(X) lt 0

-
-
-
-
-
-







F
3
Linear vs Quadratic vs Polynomial
  • Linear a sum of weighted features
  • Quadratic considers also concatenations of two
    features
  • Polynomial considers concatenations of multiple
    features (degree p denotes the size of
    concatenations)

4
Linear vs Quadratic vs Polynomial
  • Linear and quadratic functions are a type of
    polynomial function
  • Linear gt Polynomial with degree p1
  • Quadratic gt Polynomial with degree p2

5
Perceptron and Winnow
  • Perceptron and Winnow Algorithms to learn a
    linear function
  • Steps
  • Initiate F by setting w and b randomly
  • For each training point xi, If F misclassifies
    xi, adjust w and b so that new F correctly
    classifies xi
  • Repeat Step 2 until F correctly classifies all
    the training points

6
Perceptron and Winnow
  • Nondeterministic
  • Several parameters to tune such as of
    iterations, how much adjust w.
  • For the same training set, they could generate
    different F depending the data ordering.
  • Limited to linear functions of w for
    polynomial function is too large to train in a
    nondeterministic way
  • Inherently requires O(np) to learn a polynomial
    function
  • SVM can learn a polynomial function in O(n)

7
Support Vector Machine (SVM)
Hwanjo Yu POSTECH (Pohang University of Science
and Technology) http//hwanjoyu.org
8
Linear vs Quadratic vs Polynomial
  • Linear a sum of weighted features
  • Quadratic considers also concatenations of two
    features
  • Polynomial considers concatenations of multiple
    features (degree p denotes the size of
    concatenations)

9
Linear vs Quadratic vs Polynomial
10
Polynomial
  • SVM learns a polynomial function with an
    arbitrary degree in linear time

Overfitting
Underfitting
Polynomial degree
11
Boundary of arbitrary shape
  • SVM can learn a classification boundary of
    arbitrary shape in a non-lazy fashion

12
Linear Support Vector Machines
  • Find a linear hyperplane (decision boundary) that
    will separate the data F(X) b WX

13
Linear Support Vector Machines
  • One Possible Solution

14
Linear Support Vector Machines
  • Another possible solution

15
Linear Support Vector Machines
  • Other possible solutions

16
Linear Support Vector Machines
  • Which one is better? B1 or B2?
  • How do you define better?

17
Linear Support Vector Machines
Xi
F(Xi) / W
  • Find hyperplane maximizes the margin gt B1 is
    better than B2

18
Linear Support Vector Machines
  • Given a labeled dataset D(X1,y2), , (Xm,ym)
    where Xi is a data vector, and yi is class label
    (1 or -1) of Xi
  • D is linearly separable,
  • if there exists a linear function F(X) gt bWX
    that correctly classifies every data vector in D,
    that is,
  • if there exists b and W that satisfies yi(bWXi)
    gt 0 for all (Xi, yi) ? D, where b is a scalar and
    W is a weight vector.

19
Linear Support Vector Machines
  • A dataset D(X1,y2), , (Xm,ym) is linearly
    separable, if
  • there exists b and W that satisfies yi(bWXi) gt 0
    for all (Xi, yi) ? D
  • Question
  • If there exist b and W that satisfies yi(bWXi) gt
    0 for all (Xi, yi) ? D,
  • Then there exist b and W that satisfies
    yi(bWXi) ? 1 for all (Xi, yi) ? D ?

20
Linear Support Vector Machines
21
Linear Support Vector Machines
  • We want to maximize
  • Instead, we minimize
  • But subjected to the following constraints
  • This is a constrained optimization problem
  • Numerical approaches to solve it (e.g., quadratic
    programming)

22
Linear Support Vector Machines
  • What if the problem is not linearly separable?

23
Linear Support Vector Machines
  • What if the problem is not linearly separable?
  • Introduce slack variables
  • Need to minimize
  • Subject to

24
Linear Support Vector Machines
  • Primal form
  • Minimize
  • Subject to
  • To solve this, transform the primal to the dual
    see the next slide

25
Linear Support Vector Machines
  • Dual form
  • W can be recovered by

m
m
m
m
26
Characteristics of the Solution
  • Many of the ai are zero
  • w is a linear combination of a small number of
    data points
  • xi with non-zero ai are called support vectors
    (SV)
  • The decision boundary is determined only by the
    SV
  • For testing with a new data z
  • Compute and
    classify z as class 1 if the sum is positive, and
    class 2 otherwise
  • Note w need not be formed explicitly

27
SVM model
  • SVM model
  • Bias b, a list of support vectors and their
    coefficients ?
  • Where is ? and How C affects the model?
  • Why dont we compute w explicitly?

28
Nonlinear Support Vector Machines
  • What if decision boundary is not linear?

29
Nonlinear Support Vector Machines
  • Transform data into higher dimensional space

30
Nonlinear Support Vector Machines
  • A naive way
  • Transform data into higher dimensional space
  • Compute a linear boundary function in the new
    feature space
  • the boundary function becomes nonlinear in the
    original feature space gt very time consuming
    though.
  • SVM Kernel trick
  • Does all of these without explicitly transforming
    data into higher dimensional space

31
Nonlinear Support Vector Machines
  • Dual form

m
m
m
32
An Example for f(.) and K(.,.)
  • Suppose f(.) is given as follows
  • An inner product in the feature space is
  • So, if we define the kernel function as follows,
    there is no need to carry out f(.) explicitly
  • This use of kernel function to avoid carrying out
    f(.) explicitly is known as the kernel trick

original feature space
new feature space
33
Kernel Functions
  • In practical use of SVM, the user specifies the
    kernel function the transformation f(.) is not
    explicitly stated
  • Another view kernel function, being an inner
    product, is really a similarity measure between
    the objects

34
Examples of Kernel Functions
  • Polynomial kernel with degree d
  • Radial basis function (RBF) kernel with width s
  • Closely related to kNN model and RBF neural
    networks
  • The feature space is infinite-dimensional
  • Sigmoid with parameter k and q
  • It does not satisfy the Mercer condition on all k
    and q

35
Modification Due to Kernel Function
  • Change all inner products to kernel functions
  • For training,

m
m
Original
m
m
m
With kernel function
m
36
Modification Due to Kernel Function
  • For testing, the new data z is classified as
    class 1 if f ³0, and as class 2 if f lt0

Original
With kernel function
37
More on Kernel Functions
  • Since the training of SVM only requires the value
    of K(xi, xj), there is no restriction of the form
    of xi and xj
  • xi can be a sequence or a tree, instead of a
    feature vector
  • K(xi, xj) is just a similarity measure comparing
    xi and xj
  • For a test object z, the discriminant function
    essentially is a weighted sum of the similarity
    between z and a pre-selected set of objects (the
    support vectors)

38
More on Kernel Functions
  • Not all similarity measure can be used as kernel
    function, however
  • The kernel function needs to satisfy the Mercer
    function, i.e., the function is
    positive-definite
  • This implies that the m by m kernel matrix, in
    which the (i,j)-th entry is the K(xi, xj), is
    always positive definite
  • This also means that the QP is convex and can be
    solved in polynomial time

39
Example
  • Suppose we have 5 1D data points
  • x11, x22, x34, x45, x56, with 1, 2, 6 as
    class 1 and 4, 5 as class 2 ? y11, y21, y3-1,
    y4-1, y51
  • We use the polynomial kernel of degree 2
  • K(x,y) (xy1)2
  • C is set to 100
  • We first find ai (i1, , 5) by

40
Example
  • By using a QP solver, we get
  • a10, a22.5, a30, a47.333, a54.833
  • Note that the constraints are indeed satisfied
  • The support vectors are x22, x45, x56
  • The discriminant function is
  • b is recovered by solving f(2)1 or by f(5)-1 or
    by f(6)1, as x2 and x5 lie on the line
    and x4 lies on the line
  • All three give b9

41
Example
Value of discriminant function
class 1
class 1
class 2
1
2
4
5
6
42
SVM model with RBF kernel
  • As delta moves, the boundary shape changes

43
Tuning parameters in SVM
  • Soft margin parameter C
  • SVM becomes soft, as C approaches to zero
  • SVM becomes hard, as C goes up
  • Polynomial kernel parameter d in
  • As d increases, boundary becomes more complex
  • Start with d 1 and increase d by 1
  • The generalization performance will stop
    improving at some point
  • RBF kernel parameter delta in
  • No deterministic way to tune delta
  • Start with a small number like 10-6, and multiply
    by 10 at each iteration up to delta106
  • Once you find a good range of delta, you can tune
    it more finely within the range.

44
SVM Implementation
  • LIBSVM
  • Highly optimized implementation
  • Provide interfaces to other languages such as
    Python, Java, etc.
  • Support various types of SVM such as multi-class
    classification, nu-SVM, one-class SVM, etc.
  • SVM-light
  • One of the earliest implementations thus also
    widely used
  • Less efficient and accurate than libsvm
    (according to my personal experience)
  • Support Ranking SVM we will discuss later
Write a Comment
User Comments (0)
About PowerShow.com