Title: Multiple Kernel Learning
1Multiple Kernel Learning
Manik Varma Microsoft Research India
2Object Categorization and Detection
3Experimental Results Caltech 101
- Adding the Gist Kernel and some post-processing
gives 98.2 Bosch et al. IJCV submitted
4Experimental Results Caltech 256
5Experimental Results PASCAL VOC2007
6The C-SVM Primal Formulation
- Minimise ½wtw C ?i?i
- Subject to
- yi wt?(xi) b 1 ?i
- ?i 0
- .where
- (xi, yi) is the ith training point.
- C is the misclassification penalty.
- Decision function f(x) sign(wt?(x) b)
7The C-SVM Dual Formulation
- Maximise 1t? ½?tYKY?
- Subject to
- 1tY? 0
- 0 ? ? ? C
- where
- ? are the Lagrange multipliers corresponding to
the support vector coeffs - Y is a diagonal matrix such that Yii yi
8Kernel Target Alignment
- Kernel Target Alignment Cristianini et al.
2001 - Alignment
- A(K1,K2) ltK1,K2gt / (ltK1,K1gtltK2,K2gt)½
- where ltK1,K2gt ?i ?j K1(xi,xj)K2(xi,xj)
- Ideal Kernel Kideal yyT
- Alignment to Ideal
- A(K) ltK,yyTgt / nltK,Kgt½
- Optimal Kernel
- Kopt ?k dkKk where Kk vkvkT (rank 1)
9Kernel Target Alignment
- Kernel Target Alignment
- Optimal Alignment
- A(Kopt) ? dkltvk,ygt2 / n(? dk2)½
- Assume ? dk2 1.
- Lagrangian
- L(?,d) ? dkltvk,ygt2 ?(? dk2 1).
- Optimal weights dk ? ltvk,ygt2
- Some generalisation bounds have been given but
the task is not directly related to classification
10Multiple Kernel Learning SDP
d2
NP Hard Region
? dk 1
d1
K 0 (SDP)
Brute force search
Lanckriet et al.
K d1 K1 d2 K2
11Multiple Kernel Learning SDP
- Multiple Kernel Learning Lanckriet et al. 2002
- Minimise ½wtw C ?i?i
- Subject to
- yi wt?d(xi) b 1 ?i
- ?i 0
- K ?k dkKk is positive semi definite
- trace(K) constant
- Optimisation is an SDP (SOCP if dk ? 0).
- Other loss functions possible (square hinge, KTA)
12Multiple Kernel Learning Block l1
d2
NP Hard Region
Bach et al. Sonnenberg et al. Rakotomamonjy et al.
? dk 1
d1
K 0 (SDP Region)
Brute force search
K d1 K1 d2 K2
13MKL Block l1 SMO
- MKL Block l1 SMO Bach et al. 2004
- Min ½ (?k ??k wk2)2 C ?i?i ½?k ak2
wk22 - Subject to
- yi ?k wkT?k(xi) b 1 ?i
- ?i 0
- M-Y reg. ensures differentiability (for SMO)
- Block l1 reg. ensures sparsity (for kernel
weights and SMO) - Optimisation is carried out via iterative SMO.
14MKL Block l1 SILP
- MKL Block l1 SILP Sonnenberg et al. 2005
- Min ½ (?k ?dk wk2)2 C ?i?i
- Subject to
- yi ?k wkT?k(xi) b 1 ?i
- ?i 0
- ?k dk 1
- Iterative SILP-QP solution.
- Solve a 10M point problem with 20 kernels
- Generalize to regression, novelty detection, etc.
15Other Formulations
- Hyperkernels Ong and Smola 2002
- Learn a kernel per training point (not per
class) - SDP formulation improved to SOCP Tsang and Kwok
2006. - Boosting
- Exp/Log loss over pairs of distances Crammer et
al. 2002 - LPBoost Bi et al. 2004
- KernelBoost Hert et al. 2006
- Multi-class MKL Zien and Ong 2007.
16Multiple Kernel Learning Varma Ray 07
d2
NP Hard Region
d 0 (SOCP Region)
d1
K 0 (SDP Region)
Brute force search
K d1 K1 d2 K2
17Our Primal Formulation
- Minimise ½wtw C ?i?i ?k?kdk
- Subject to
- yi wt?d(xi) b 1 ?i
- ?i 0
- dk 0
- K(xi,xj) ?k dk Kk(xi,xj)
- Very efficient gradient descent based solution
- Similar to Rakotomamonjy et al. 2007 but more
accurate as our search space is larger.
18Final Algorithm
- Initialise d0 randomly
- Repeat until convergence
- Form K(x,y) ?k dkn Kk(x,y)
- Use any SVM solver to solve the standard SVM
problem with kernel K and obtain ?. - Update dkn1 dkn ?n(?k ½?tYKkY?)
- Project dn1 back onto the feasible set if it
does not satisfy the constraints dn1 0
19Kernel Generalizations
- The learnt kernel can now have any functional
form as long as - ?dK(d) exists and is continuous.
- K(d) is strictly positive definite for feasible
d. - For example, K(d) ?k dk0 ??l exp( dkl ?2)
20Regularization Generalizations
- Any regularizer can be used as long as it has
continuous first derivative with respect to d - We can now put Gaussian rather than Laplacian
priors on the kernel weights. - We can, once again, have negative weights.
21Loss Function Generalizations
- The loss function can be generalized to handle
- Regression.
- Novelty detection (1 class SVM).
- Multi-class classification.
- Ordinal Regression.
- Ranking.
22Penalties
- Our formulation is no longer convex
- Somehow, this seems not to make much of a
difference. - Furthermore, early termination can sometimes
improve results,
23Regression on Hot or Not Training Data
7.3
6.5
9.4
7.5
7.7
7.7
6.5
6.9
7.4
8.7
24Predict Hotness
25Learning Discriminative Object Features/Pixels