Multiple Kernel Learning - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Multiple Kernel Learning

Description:

Multiple Kernel Learning. Manik Varma. Microsoft Research. India. Object ... Ordinal Regression. Ranking. Penalties. Our formulation is no longer convex ... – PowerPoint PPT presentation

Number of Views:269

Avg rating:3.0/5.0

Slides: 26

Provided by: Man6160

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Kernel Learning

1
Multiple Kernel Learning
Manik Varma Microsoft Research India
2
Object Categorization and Detection
3
Experimental Results Caltech 101

Adding the Gist Kernel and some post-processing
gives 98.2 Bosch et al. IJCV submitted

4
Experimental Results Caltech 256
5
Experimental Results PASCAL VOC2007
6
The C-SVM Primal Formulation

Minimise ½wtw C ?i?i
Subject to
yi wt?(xi) b 1 ?i
?i 0
.where
(xi, yi) is the ith training point.
C is the misclassification penalty.
Decision function f(x) sign(wt?(x) b)

7
The C-SVM Dual Formulation

Maximise 1t? ½?tYKY?
Subject to
1tY? 0
0 ? ? ? C
where
? are the Lagrange multipliers corresponding to
the support vector coeffs
Y is a diagonal matrix such that Yii yi

8
Kernel Target Alignment

Kernel Target Alignment Cristianini et al.
2001
Alignment
A(K1,K2) ltK1,K2gt / (ltK1,K1gtltK2,K2gt)½
where ltK1,K2gt ?i ?j K1(xi,xj)K2(xi,xj)
Ideal Kernel Kideal yyT
Alignment to Ideal
A(K) ltK,yyTgt / nltK,Kgt½
Optimal Kernel
Kopt ?k dkKk where Kk vkvkT (rank 1)

9
Kernel Target Alignment

Kernel Target Alignment
Optimal Alignment
A(Kopt) ? dkltvk,ygt2 / n(? dk2)½
Assume ? dk2 1.
Lagrangian
L(?,d) ? dkltvk,ygt2 ?(? dk2 1).
Optimal weights dk ? ltvk,ygt2
Some generalisation bounds have been given but
the task is not directly related to classification

10
Multiple Kernel Learning SDP
d2
NP Hard Region
? dk 1
d1
K 0 (SDP)
Brute force search
Lanckriet et al.
K d1 K1 d2 K2
11
Multiple Kernel Learning SDP

Multiple Kernel Learning Lanckriet et al. 2002
Minimise ½wtw C ?i?i
Subject to
yi wt?d(xi) b 1 ?i
?i 0
K ?k dkKk is positive semi definite
trace(K) constant
Optimisation is an SDP (SOCP if dk ? 0).
Other loss functions possible (square hinge, KTA)

12
Multiple Kernel Learning Block l1
d2
NP Hard Region
Bach et al. Sonnenberg et al. Rakotomamonjy et al.
? dk 1
d1
K 0 (SDP Region)
Brute force search
K d1 K1 d2 K2
13
MKL Block l1 SMO

MKL Block l1 SMO Bach et al. 2004
Min ½ (?k ??k wk2)2 C ?i?i ½?k ak2
wk22
Subject to
yi ?k wkT?k(xi) b 1 ?i
?i 0
M-Y reg. ensures differentiability (for SMO)
Block l1 reg. ensures sparsity (for kernel
weights and SMO)
Optimisation is carried out via iterative SMO.

14
MKL Block l1 SILP

MKL Block l1 SILP Sonnenberg et al. 2005
Min ½ (?k ?dk wk2)2 C ?i?i
Subject to
yi ?k wkT?k(xi) b 1 ?i
?i 0
?k dk 1
Iterative SILP-QP solution.
Solve a 10M point problem with 20 kernels
Generalize to regression, novelty detection, etc.

15
Other Formulations

Hyperkernels Ong and Smola 2002
Learn a kernel per training point (not per
class)
SDP formulation improved to SOCP Tsang and Kwok
2006.
Boosting
Exp/Log loss over pairs of distances Crammer et
al. 2002
LPBoost Bi et al. 2004
KernelBoost Hert et al. 2006
Multi-class MKL Zien and Ong 2007.

16
Multiple Kernel Learning Varma Ray 07
d2
NP Hard Region
d 0 (SOCP Region)
d1
K 0 (SDP Region)
Brute force search
K d1 K1 d2 K2
17
Our Primal Formulation

Minimise ½wtw C ?i?i ?k?kdk
Subject to
yi wt?d(xi) b 1 ?i
?i 0
dk 0
K(xi,xj) ?k dk Kk(xi,xj)
Very efficient gradient descent based solution
Similar to Rakotomamonjy et al. 2007 but more
accurate as our search space is larger.

18
Final Algorithm

Initialise d0 randomly
Repeat until convergence
Form K(x,y) ?k dkn Kk(x,y)
Use any SVM solver to solve the standard SVM
problem with kernel K and obtain ?.
Update dkn1 dkn ?n(?k ½?tYKkY?)
Project dn1 back onto the feasible set if it
does not satisfy the constraints dn1 0

19
Kernel Generalizations

The learnt kernel can now have any functional
form as long as
?dK(d) exists and is continuous.
K(d) is strictly positive definite for feasible
d.
For example, K(d) ?k dk0 ??l exp( dkl ?2)

20
Regularization Generalizations

Any regularizer can be used as long as it has
continuous first derivative with respect to d
We can now put Gaussian rather than Laplacian
priors on the kernel weights.
We can, once again, have negative weights.

21
Loss Function Generalizations