Classification - PowerPoint PPT Presentation

1 / 112
About This Presentation
Title:

Classification

Description:

Classification Yan Pan Some Popular Kernels Linear : K(xi,xj) = xit -1xj Polynomial : K(xi,xj) = (xit -1xj + c)d Gaussian (RBF) : K(xi,xj) = exp( k k(xik ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 113
Provided by: Man37
Category:

less

Transcript and Presenter's Notes

Title: Classification


1
Classification
Yan Pan
2
Under and Over Fitting
3
Probability Theory
  • Non-negativity and unit measure
  • 0 p(y) , p(?) 1, p(?) 0
  • Conditional probability p(yx)
  • p(x, y) p(yx) p(x) p(xy) p(y)
  • Bayes Theorem
  • p(yx) p(xy) p(y) / p(x)
  • Marginalization
  • p(x) ?y p(x, y) dy
  • Independence
  • p(x1, x2) p(x1) p(x2) ? p(x1x2) p(x1)
  • Chris Bishop, Pattern Recognition Machine
    Learning

4
(No Transcript)
5
The Univariate Gaussian Density
  • p(x?,?) exp( -(x ?)2/2?2) / (2??2)½

?
1?
-1?
2?
-3?
3?
-2?
6
The Multivariate Gaussian Density
  • p(x?,?) exp( -½ (x ?)t ?-1 (x ?) )/
    (2?)D/2?½

7
The Beta Density
  • p(?a,b) ?a-1(1 ?)b-1 ?(ab) / ?(a)?(b)

8
Probability Distribution Functions
  • Bernoulli Single trial with probability of
    success ?
  • n ? 0, 1, ? ? 0, 1
  • p(n?) ? n(1 ?)1-n
  • Binomial N iid Bernoulli trials with n
    successes
  • n ? 0, 1, , N, ? ? 0, 1,
  • p(nN,?) NCn? n(1 ?)N-n

9
A Toy Example
  • We dont know whether a coin is fair or not. We
    are told that heads occurred n times in N coin
    flips.
  • We are asked to predict whether the next coin
    flip will result in a head or a tail.
  • Let y be a binary random variable such that y
    1 represents the event that the next coin flip
    will be a head and y 0 that it will be a tail
  • We should predict heads if p(y1n,N) gt
    p(y0n,N)

10
The Maximum Likelihood Approach
  • Let p(y1n,N) ? and p(y0n,N) 1 - ? so
    that we should predict heads if ? gt ½
  • How should we estimate ??
  • Assuming that the observed coin flips followed a
    Binomial distribution, we could choose the value
    of ? that maximizes the likelihood of observing
    the data
  • ?ML argmax? p(n?) argmax? NCn? n(1 ?)N-n
  • argmax? n log(?) (N n) log(1 ?)
  • n / N
  • We should predict heads if n gt ½ N

11
The Maximum A Posteriori Approach
  • We should choose the value of ? maximizing the
    posterior probability of ? conditioned on the
    data
  • We assume a
  • Binomial likelihood p(n?) NCn? n(1 ?)N-n
  • Beta prior p(?a,b)?a-1(1?)b-1?(ab)/?(a)?(b)
  • ?MAP argmax? p(?n,a,b) argmax? p(n?)
    p(?a,b)
  • argmax? ?n (1 ?)N-n ?a-1 (1?)b-1
  • (na-1) / (Nab-2) as if we saw an extra a
    1 heads b 1 tails
  • We should predict heads if n gt ½ (N b a)

12
The Bayesian Approach
  • We should marginalize over ?
  • p(y1n,a,b) ?? p(y1n,?) p(?a,b,n) d?
  • ?? ? p(?a,b,n) d?
  • ?? ? ?(?a n, b N n) d?
  • (n a) / (N a b) as if we saw an extra a
    heads b tails
  • We should predict heads if n gt ½ (N b a)
  • The Bayesian and MAP prediction coincide in this
    case
  • In the very large data limit, both the Bayesian
    and MAP prediction coincide with the ML
    prediction (n gt ½ N)

13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
Classification
19
Binary Classification
20
Approaches to Classification
  • Memorization
  • Can not deal with previously unseen data
  • Large scale annotated data acquisition cost
    might be very high
  • Rule based expert system
  • Dependent on the competence of the expert.
  • Complex problems lead to a proliferation of
    rules, exceptions, exceptions to exceptions, etc.
  • Rules might not transfer to similar problems
  • Learning from training data and prior knowledge
  • Focuses on generalization to novel data

21
Notation
  • Training Data
  • Set of N labeled examples of the form (xi, yi)
  • Feature vector x ? ?D. X x1 x2 xN
  • Label y ? ?1. y y1, y2 yNt. Ydiag(y)
  • Example Gender Identification

(x1 , y1 1)
(x2 , y2 1)
(x3 , y3 1)
(x4 , y4 -1)
22
Binary Classification
23
Binary Classification
b
w
wtx b 0
? w b
24
Machine Learning from the Optimization View
  • Before we go into the details of classification
    and regression methods, we should take a close
    look at the objective functions of machine
    learning
  • Machine Learning???????(?????????????),?????????
  • ????????????????,???????????,?????????????????

25
Supervised Learning
26
Common Form of Supervised Learning Problems
  • Minimize the following objective function
  • Regularization term Loss function
  • Regularization term control the model
    complexity, avoid over fitting
  • Loss function measure the quality of the learned
    function, i.e. predict error on the training data.

27
Ex.1 Linear Regression
  • E(w) ½ Sn (yn - wtxn)2 ½?wtw

28
Ex.2 Logistic Regression (classification method)
  • ?(w, b) ½?wtw ?I log(1exp(-yi(bwtxi)))

29
Ex.3 SVM
  • E(w) ½?wtw ?I max(0,1-yiwtxi)
  • Or
  • E(w) ½?wtw ?I max(0,1-yiwtxi)2

30
How to measure error?
  • True yi
  • Predicted wtxi
  • ????????
  • I (yi ! wtxi )
  • ( yi - wtxi )2
  • ???????-1,1 ?????
  • yi wtxi

31
Approximate the Zero-One Loss
  • Squared Error
  • Exponential Loss
  • Logistic Loss
  • Hinge Loss
  • Sigmoid Loss

32
Regularized Logistic Regression
Zhu Hastie, KLR and the
Import Vector Machine, NIPS 01
33
Regularized Logistic Regression
Zhu Hastie, KLR and the
Import Vector Machine, NIPS 01
34
(No Transcript)
35
(No Transcript)
36
Convex Functions
  • Convex f f(?x1 (1- ?)x2) ? ? f(x1) (1-
    ?)f(x2)
  • The Hessian ?2f is always positive
    semi-definite
  • The tangent is always a lower bound to f

37
(No Transcript)
38
Gradient Descent
  • Iteration xn1 xn - ?n?f(xn)
  • Step size selection Armijo rule
  • Stopping criterion Change in f is miniscule

39
Gradient Descent Logistic Regression
  • ?(w, b) ½?wtw ?I log(1exp(-yi(bwtxi)))
  • ?w?(w, b) ?w ?I p(-yixi,w) yi xi
  • ?b?(w, b) ?I p(-yixi,w) yi
  • Beware of numerical issues while coding!

40
Gradient Decent Algorithm
  • Input x0, objective f(x), e, T
  • Output x_star that minimize f(x)
  • t0
  • While (t0 (f(x_t-1) f(x_t)gte
    Tlt100000 ))
  • g_t gradient of f(x) at x_t
  • for( i10 igt-6 i--)
  • s2i
  • x_t1x_t sg_t
  • if (f(x_t1 lt f(x_t))
  • break
  • t
  • Output x_t

41
Newton Methods
  • Iteration xn1 xn - ?nH-1?f(xn)
  • Approximate f by a 2nd order Taylor expansion
  • The error can now decrease quadratically

42
Newton Decent Algorithm
  • Input x0, objective f(x), e, T
  • Output x_star that minimize f(x)
  • t0
  • While (t0 (f(x_t-1) f(x_t)gte Tlt10))
  • g_t gradient of f(x) at x_t
  • h_t hessian matrix of f(x) at x_t
  • s inverse matrix of h_t
  • x_t1x_t sg_t
  • t
  • Output x_t

43
Quasi-Newton Methods
  • Computing and inverting the Hessian is expensive
  • Quasi-Newton methods can approximate H-1
    directly (LBFGS)
  • Iteration xn1 xn - ?nBn-1?f(xn)
  • Secant equation ?f(xn1) ?f(xn) Bn1(xn1
    xn)
  • The secant equation does not fully determine B
  • LBFGS updates Bn1-1 using two rank one matrices

44
Machine Learning Problems from the Probability
View
45
Bayes Decision Rule
  • Bayes decision rule
  • p(y1x) gt p(y-1x) ? y 1 y -1
  • ?? p(y1x) gt ½ ? y 1 y -1

46
Bayesian Approach
  • p(yx,X,Y) ?f p(y,fx,X,Y) df
  • ?f p(yf,x,X,Y) p(fx,X,Y) df
  • ?f p(yf,x) p(fX,Y) df
  • This integral is often intractable.
  • To solve it we can
  • Choose the distributions so that the solution is
    analytic (conjugate priors)
  • Approximate the true distribution of p(fX,Y) by
    a simpler distribution (variational methods)
  • Sample from p(fX,Y) (MCMC)

47
Maximum A Posteriori (MAP)
  • p(yx,X,Y) ?f p(yf,x) p(fX,Y) df
  • p(yfMAP,x) when p(fX,Y) ?(f fMAP)
  • The more training data there is the better
    p(fX,Y) approximates a delta function
  • We can make predictions using a single function,
    fMAP, and our focus shifts to estimating fMAP.

48
MAP Maximum Likelihood (ML)
  • fMAP argmaxf p(fX,Y)
  • argmaxf p(X,Yf) p(f) / p(X,Y)
  • argmaxf p(X,Yf) p(f)
  • fML ? argmaxf p(X,Yf) (Maximum Likelihood)
  • Maximum Likelihood holds if
  • There is a lot of training data so that
  • p(X,Yf) gtgt p(f)
  • Or if there is no prior knowledge so that p(f)
    is uniform (improper)

49
IID Data
  • fML argmaxf p(X,Yf)
  • argmaxf ?I p(xi,yif)
  • The independent and identically distributed
    assumption holds only if we know everything about
    the joint distribution of the features and
    labels.
  • In particular, p(X,Y) ? ?I p(xi,yi)

50
Discriminative Methods Logistic Regression
51
Discriminative Methods
  • ?MAP argmax? p(?) ?I p(xi,yi ?)
  • We assume that
  • p(?) p(w) p(w?)
  • p(xi,yi ?) p(yi xi, ?) p(xi ?)
  • p(yi xi, w) p(xi w?)
  • ?MAP argmaxw p(w) ?I p(yi xi, w)
  • argmaxw? p(w?) ?I p(xiw?)
  • It turns out that only w plays a role in
    determining the posterior distribution
  • p(yx,X,Y) p(yx, ?MAP) p(yx, wMAP)
  • where wMAP argmaxw p(w) ?I p(yi xi, w)

52
Disc. Methods Logistic Regression
  • ?MAP argmaxw,b p(w) ?I p(yi xi, w)
  • Regularized Logistic Regression
  • Gaussian prior p(w) exp( -½ ? wtw)
  • Logistic likelihood
  • p(yi xi, w) 1 / (1 exp(-yi(b wtxi)))

53
Regularized Logistic Regression
  • ?MAP argmaxw,b p(w) ?I p(yi xi, w)
  • argminw,b ½?wtw ?I log(1exp(-yi(bwtxi)))
  • Bad news No closed form solution for w and b
  • Good news We have to minimize a convex function
  • We can obtain the global optimum
  • The function is smooth
  • Tom Minka, A comparison of numerical optimizers
    for LR (Matlab code)
  • Keerthi et al., A Fast Dual Algorithm for Kernel
    Logistic Regression, ML 05
  • Andrew and Gao, OWL-QN ICML 07
  • Krishnapuram et al., SMLR PAMI 05

54
Regularized Logistic Regression
Zhu Hastie, KLR and the
Import Vector Machine, NIPS 01
55
Regularized Logistic Regression
Zhu Hastie, KLR and the
Import Vector Machine, NIPS 01
56
Convex Functions
  • Convex f f(?x1 (1- ?)x2) ? ? f(x1) (1-
    ?)f(x2)
  • The Hessian ?2f is always positive
    semi-definite
  • The tangent is always a lower bound to f

57
Gradient Descent
  • Iteration xn1 xn - ?n?f(xn)
  • Step size selection Armijo rule
  • Stopping criterion Change in f is miniscule

58
Gradient Descent Logistic Regression
  • ?(w, b) ½?wtw ?I log(1exp(-yi(bwtxi)))
  • ?w?(w, b) ?w ?I p(-yixi,w) yi xi
  • ?b?(w, b) ?I p(-yixi,w) yi
  • Beware of numerical issues while coding!

59
Newton Methods
  • Iteration xn1 xn - ?nH-1?f(xn)
  • Approximate f by a 2nd order Taylor expansion
  • The error can now decrease quadratically

60
Quasi-Newton Methods
  • Computing and inverting the Hessian is expensive
  • Quasi-Newton methods can approximate H-1
    directly (LBFGS)
  • Iteration xn1 xn - ?nBn-1?f(xn)
  • Secant equation ?f(xn1) ?f(xn) Bn1(xn1
    xn)
  • The secant equation does not fully determine B
  • LBFGS updates Bn1-1 using two rank one matrices

61
Multi-class Logistic Regression
  • Multinomial Logistic Regression
  • 1-vs-All
  • Learn L binary classifiers for an L class
    problem
  • For the lth classifier, examples from class l
    are ve while examples from all other classes are
    ve
  • Classify new points according to max probability
  • 1-vs-1
  • Learn L(L-1)/2 binary classifiers for an L class
    problem by considering every class pair
  • Classify novel points by majority vote
  • Classify novel points by building a DAG

62
Multi-class Logistic Regression
  • Assume
  • Non-linear multi-class classifier
  • Number of classes L
  • Number of training points per class N
  • Algorithm training time for M points O(M3)
  • Classification time given M training pointsO(M)

63
Multi-class Logistic Regression
  • Multinomial Logistic Regression
  • Training time O(L6N3)
  • Classification time for a new point O(L2N)
  • 1-vs-All
  • Training time O(L4N3)
  • Classification time for a new point O(L2N)
  • 1-vs-1
  • Training time O(L2N3)
  • Majority vote classification time O(L2N)
  • DAG classification time O(LN)

64
Multinomial Logistic Regression
  • ?MAP argmaxw,b p(w) ?I p(yi xi, w)
  • Regularized Multinomial Logistic Regression
  • Gaussian prior
  • p(w) exp( -½ ? ?lwltwl)
  • Multinomial logistic posterior
  • p(yi l xi, w) efl(xi) / ?k efk(xi)
  • where fk(xi) wktxi bk
  • Note that we have to learn an extra classifier by
    not explicitly enforcing ?l p(yi l xi, w) 1

65
Multinomial Logistic Regression
  • ?(w, b) ½? ?kwktwk ?I log(?k fk(xi)) -
    ?k?kyi fk(xi)
  • ?wk?(w, b) ?wk ?I p(yi k xi,w) - ?kyi
    xi
  • ?bk?(w, b) ?I p(yi k xi,w) - ?kyi

66
Multi-class Logistic Regression
67
Multi-class Logistic Regression
68
Multi-class Logistic Regression
69
Multi-class Logistic Regression
70
Multi-class Logistic Regression
71
Multi-class Logistic Regression
72
From Probabilities to Loss Functions
?MAP argminw,b ½?wtw ?I log(1exp(1-yi(bwtxi
)))
73
Support Vector Machines
74
Binary Classification
75
A Separating Hyperplane
76
Maximum Margin Hyperplane
Geometric Intuition Choose the perpendicular
bisector of the shortest line segment joining the
convex hulls of the two classes
77
SVM Notation
Margin 2 / ?wtw
Support Vector
b
Support Vector
Support Vector
Support Vector
w
wtx b -1
wtx b 0
wtx b 1
78
Calculating the Margin
  • Let x be any point on the ve supporting plane
    and x- the closest point on the ve supporting
    plane
  • Margin x x-
  • ? w (since x x- ?w)
  • 2 w/w2 (assuming ? 2/w2)
  • 2/w
  • wtx b 1
  • wtx- b -1
  • ? wt(x x-) 2 ? ? wtw 2 ? ? 2/w2

79
Hard Margin SVM Primal
  • Maximize 2/w
  • such that wtxi b ? 1 if yi 1
  • wtxi b ? -1 if yi -1
  • Difficult to optimize directly
  • Convex Quadratic Program (QP) reformulation
  • Minimize ½wtw
  • such that yi(wtxi b) ? 1
  • Convex QPs can be easy to optimize

80
Linearly Inseparable Data
  • Minimize ½wtw C (Misclassified points)
  • such that yi(wtxi b) ? 1 (for good
    points)
  • The optimization problem is NP Hard in general
  • Disastrous errors are penalized the same as near
    misses

81
Inseparable Data Hinge Loss
Margin 2 / ?wtw
? gt 1
Misclassified point
? lt 1
b
Support Vector
? 0
Support Vector
w
wtx b -1
? 0
wtx b 0
wtx b 1
82
The C-SVM Primal Formulation
  • Minimize ½wtw C ?i ?i
  • such that yi(wtxi b) ? 1 ?i
  • ?i ? 0
  • The optimization is a convex QP
  • The globally optimal solution will be obtained
  • Number of variables D N 1
  • Number of constraints 2N
  • Solvers can train on 800K points in 47K (sparse)
    dimensions in less than 2 minutes on a standard
    PC
  • Fan et al., LIBLINEAR JMLR 08
  • Bordes et al., LaRank ICML 07

83
The C-SVM Dual Formulation
  • Maximize 1t? ½?tYKY?
  • such that 1tY? 0
  • 0 ? ? ? C
  • K is a kernel matrix such that Kij K(xi, xj)
    xitxj
  • ? are the dual variables (Lagrange multipliers)
  • Knowing ? gives us w and b
  • The dual is also a convex QP
  • Number of variables N
  • Number of constraints 2N 1
  • Fan et al., LIBSVM JMLR 05
  • Joachims, SVMLight

84
SVMs versus Regularized LR
Most of the SVM ?s are zero!
85
SVMs versus Regularized LR
Most of the SVM ?s are zero!
86
SVMs versus Regularized LR
Most of the SVM ?s are not zero
87
Duality
  • Primal P Minx f0(x)
  • s. t. fi(x) ? 0 1 ? i ? N
  • hi(x) 0 1 ? i ? M
  • Lagrangian L(x,?,?) f0(x) ?i ?ifi(x) ?i
    ?ihi(x)
  • Dual D Max?,? Minx L(x,?,?)
  • s. t. ? ? 0

88
Duality
  • The Lagrange dual is always concave (even if the
    primal is not convex) and might be an easier
    problem to optimize
  • Weak duality P ? D
  • Always holds
  • Strong duality P D
  • Does not always hold
  • Usually holds for convex problems
  • Holds for the SVM QP

89
Karush-Kuhn-Tucker (KKT) Conditions
  • If strong duality holds, then for x, ? and ?
    to be optimal the following KKT conditions must
    necessarily hold
  • Primal feasibility fi(x) ? 0 hi(x) 0
    for 1 ? i
  • Dual feasibility ? ? 0
  • Stationarity ?x L(x, ?,?) 0
  • Complimentary slackness ?ifi(x) 0
  • If x, ? and ? satisfy the KKT conditions for
    a convex problem then they are optimal

90
SVM Duality
  • Primal P Minw,?,b ½wtw Ct?
  • s. t. Y(Xtw b1) ? 1 ?
  • ? ? 0
  • Lagrangian L(?,?, w,?,b) ½wtw Ct? ?t?
  • ?tY(Xtw b1) 1 ?
  • Dual D Max? 1t? ½?tYKY?
  • s. t. 1tY? 0
  • 0 ? ? ? C

91
SVM KKT Conditions
  • Lagrangian L(?,?, w,?,b) ½wtw Ct? ?t?
  • ?tY(Xtw b1) 1 ?
  • Stationarity conditions
  • ?w L 0 ? w XY? (Representer Theorem)
  • ?? L 0 ? C ? ?
  • ?b L 0 ? ?tY1 0
  • Complimentary Slackness conditions
  • ?i yi (xitw b) 1 ?i 0
  • ?i?i 0

92
Hinge Loss and Sparseness in ?
  • From the Stationarity and Complimentary
    Slackness conditions it is easy to show that
  • ?i 0 ? xi has been classified correctly and
    lies beyond its supporting hyperplane
  • 0 lt ?i lt C ? xi is a support vector and lies on
    its supporting hyperplane
  • ?i C ? xi has been misclassified or is a
    margin violator

93
Hinge Loss and Sparseness in ?
  • SVM ?s are sparse but LR ?s are not

94
Linearly Inseparable Data
  • This 1D dataset can not be separated using a
    single hyperplane (threshold)
  • We need a non-linear decision boundary

x
95
Increasing Dimensionality Non-linearly
  • The dataset is now linearly separable in ? space

x
?(x) (x, x2)
96
The Kernel Trick
  • Let the lifted training set be (?(xi), yi)
  • Define the kernel such that Kij K(xi, xj)
    ?(xi)t ?(xj)
  • Primal P Minw,?,b ½wtw Ct?
  • s. t. Y(?(X)tw b1) ? 1 ?
  • ? ? 0
  • Dual D Max? 1t? ½?tYKY?
  • s. t. 1tY? 0
  • 0 ? ? ? C
  • Classifier f(x) sign(?(x)tw b)
    sign(?tYK(,x) b)

97
The Kernel Trick
  • Let ?(x) 1, ?2x1, , ?2xD , x12, , xD2,
    ?2x1x2, , ?2x1xD, , ?2xD-1xDt
  • Define K(xi, xj) ?(xi)t ?(xj) (xitxj 1)2
  • Primal
  • Number of variables D? N 1
  • Number of constraints 2N
  • Number of flops for calculating ?(x)tw O(D2)
  • Number of flops for deg 20 polynomial O(D20)
  • Dual
  • Number of variables N
  • Number of constraints 2N 1
  • Number of flops for calculating Kij O(D)
  • Number of flops for deg 20 polynomial O(D)

98
Some Popular Kernels
  • Linear K(xi,xj) xit?-1xj
  • Polynomial K(xi,xj) (xit?-1xj c)d
  • Gaussian (RBF) K(xi,xj) exp( ?k ?k(xik
    xjk)2)
  • Chi-Squared K(xi,xj) exp( ?2(xi, xj) )
  • Sigmoid K(xi,xj) tanh(xitxj c)
  • ? should be positive definite, c ? 0, ? ? 0 and d
    should be a natural number

99
Valid Kernels Mercers Theorem
  • Let Z be a compact subset of ?D and K a
    continuous symmetric function. Then K is a kernel
    if
  • ?Z ? Z f(x) K(x,z) f(z) dx dz ? 0
  • for all square integrable real valued function f
    on Z.

100
Valid Kernels Mercers Theorem
  • Let Z be a compact subset of ?D and K a
    continuous symmetric function. Then K is a kernel
    if
  • ?Z ? Z f(x) K(x,z) f(z) dx dz ? 0
  • for all square integrable real valued function f
    on Z.
  • K is a kernel if every finite symmetric matrix
    formed by evaluating K on pairs of points from Z
    is positive semi-definite

101
Operations on Kernels
  • The following operations result in valid kernels
  • K(xi,xj) ?k ?k Kk(xi,xj) (?k ? 0)
  • K(xi,xj) ?k Kk(xi,xj)
  • K(xi,xj) f(xi) f(xj) (f ?D ? ?)
  • K(xi,xj) p(K1(xi,xj)) (p ve coeff poly)
  • K(xi,xj) exp(K1(xi,xj))
  • Kernels can be defined over graphs, sets,
    strings and many other interesting data structures

102
Kernels
  • Kernels should encode all our prior knowledge
    about feature similarities.
  • Kernel parameters can be chosen through cross
    validation or learnt (see Multiple Kernel
    Learning).
  • Non-linear kernels can sometimes boost
    classification performance tremendously.
  • Non-linear kernels are generally expensive (both
    during training and for prediction)

103
Polynomial Kernel of Degree 2
104
Polynomial Kernel of Degree 5
105
RBF Kernel
106
Exponential ?2 Kernel
107
Kernel Parameter Setting - Underfitting
108
Kernel Parameter Setting
109
Kernel Parameter Setting Overfitting
110
Structured Output Prediction
  • Minimize f ½f2 C ?i ?i
  • such that f(xi,yi) ? f(xi,y) ?(yi,y) ?i
    ? y ? yi
  • ?i ? 0
  • Prediction argmaxy f(x,y)
  • This formulation minimizes the hinge on the loss
    ? on the training set subject to regularization
    on f
  • Can be used to predict sets, graphs, etc. for
    suitable choices of ?
  • Taskar et al., Max-Margin Markov Networks NIPS
    03
  • Tsochantaridis et al., Large Margin Methods for
    Structured Interdependent Output Variables
    JMLR 05

111
Multi-Class SVM
  • Minimize f ½f2 C ?i ?i
  • such that f(xi,yi) ? f(xi,y) ?(yi,y) ?i
    ? y ? yi
  • ?i ? 0
  • Prediction argmaxy f(x,y)
  • ?(yi,y) ?yi,y
  • f(x,y) wt ?(x) ? ?(y) bt?(y)
  • wyt?(x) by (assuming ?(y) ey)
  • Weston and Watkin, SVMs for Multi-Class Pattern
    Recognition ESANN 99
  • Bordes et al., LaRank ICML 07

112
Multi-Class SVM
  • Min w,b ½ ?kwktwk C ?i ?i
  • s. t. wyit?(xi) byi ? wyt?(xi) by 1
    ?I ?y?yi
  • ?i ? 0
  • Prediction argmaxy wyt?(x) by
  • For L classes, with N points per class, the
    number of constraints is NL2
  • Finding the exact solution for real world
    non-linear problems is often infeasible
  • In practice, we can obtain an approximate
    solution or switch to the 1-vs-All or 1-vs-1
    formulations
Write a Comment
User Comments (0)
About PowerShow.com