Multiple Kernel Learning - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Multiple Kernel Learning

Description:

Multiple Kernel Learning. Manik Varma. Microsoft Research. India. Object ... Ordinal Regression. Ranking. Penalties. Our formulation is no longer convex ... – PowerPoint PPT presentation

Number of Views:269
Avg rating:3.0/5.0
Slides: 26
Provided by: Man6160
Category:

less

Transcript and Presenter's Notes

Title: Multiple Kernel Learning


1
Multiple Kernel Learning
Manik Varma Microsoft Research India
2
Object Categorization and Detection
3
Experimental Results Caltech 101
  • Adding the Gist Kernel and some post-processing
    gives 98.2 Bosch et al. IJCV submitted

4
Experimental Results Caltech 256
5
Experimental Results PASCAL VOC2007
6
The C-SVM Primal Formulation
  • Minimise ½wtw C ?i?i
  • Subject to
  • yi wt?(xi) b 1 ?i
  • ?i 0
  • .where
  • (xi, yi) is the ith training point.
  • C is the misclassification penalty.
  • Decision function f(x) sign(wt?(x) b)

7
The C-SVM Dual Formulation
  • Maximise 1t? ½?tYKY?
  • Subject to
  • 1tY? 0
  • 0 ? ? ? C
  • where
  • ? are the Lagrange multipliers corresponding to
    the support vector coeffs
  • Y is a diagonal matrix such that Yii yi

8
Kernel Target Alignment
  • Kernel Target Alignment Cristianini et al.
    2001
  • Alignment
  • A(K1,K2) ltK1,K2gt / (ltK1,K1gtltK2,K2gt)½
  • where ltK1,K2gt ?i ?j K1(xi,xj)K2(xi,xj)
  • Ideal Kernel Kideal yyT
  • Alignment to Ideal
  • A(K) ltK,yyTgt / nltK,Kgt½
  • Optimal Kernel
  • Kopt ?k dkKk where Kk vkvkT (rank 1)

9
Kernel Target Alignment
  • Kernel Target Alignment
  • Optimal Alignment
  • A(Kopt) ? dkltvk,ygt2 / n(? dk2)½
  • Assume ? dk2 1.
  • Lagrangian
  • L(?,d) ? dkltvk,ygt2 ?(? dk2 1).
  • Optimal weights dk ? ltvk,ygt2
  • Some generalisation bounds have been given but
    the task is not directly related to classification

10
Multiple Kernel Learning SDP
d2
NP Hard Region
? dk 1
d1
K 0 (SDP)
Brute force search
Lanckriet et al.
K d1 K1 d2 K2
11
Multiple Kernel Learning SDP
  • Multiple Kernel Learning Lanckriet et al. 2002
  • Minimise ½wtw C ?i?i
  • Subject to
  • yi wt?d(xi) b 1 ?i
  • ?i 0
  • K ?k dkKk is positive semi definite
  • trace(K) constant
  • Optimisation is an SDP (SOCP if dk ? 0).
  • Other loss functions possible (square hinge, KTA)

12
Multiple Kernel Learning Block l1
d2
NP Hard Region
Bach et al. Sonnenberg et al. Rakotomamonjy et al.
? dk 1
d1
K 0 (SDP Region)
Brute force search
K d1 K1 d2 K2
13
MKL Block l1 SMO
  • MKL Block l1 SMO Bach et al. 2004
  • Min ½ (?k ??k wk2)2 C ?i?i ½?k ak2
    wk22
  • Subject to
  • yi ?k wkT?k(xi) b 1 ?i
  • ?i 0
  • M-Y reg. ensures differentiability (for SMO)
  • Block l1 reg. ensures sparsity (for kernel
    weights and SMO)
  • Optimisation is carried out via iterative SMO.

14
MKL Block l1 SILP
  • MKL Block l1 SILP Sonnenberg et al. 2005
  • Min ½ (?k ?dk wk2)2 C ?i?i
  • Subject to
  • yi ?k wkT?k(xi) b 1 ?i
  • ?i 0
  • ?k dk 1
  • Iterative SILP-QP solution.
  • Solve a 10M point problem with 20 kernels
  • Generalize to regression, novelty detection, etc.

15
Other Formulations
  • Hyperkernels Ong and Smola 2002
  • Learn a kernel per training point (not per
    class)
  • SDP formulation improved to SOCP Tsang and Kwok
    2006.
  • Boosting
  • Exp/Log loss over pairs of distances Crammer et
    al. 2002
  • LPBoost Bi et al. 2004
  • KernelBoost Hert et al. 2006
  • Multi-class MKL Zien and Ong 2007.

16
Multiple Kernel Learning Varma Ray 07
d2
NP Hard Region
d 0 (SOCP Region)
d1
K 0 (SDP Region)
Brute force search
K d1 K1 d2 K2
17
Our Primal Formulation
  • Minimise ½wtw C ?i?i ?k?kdk
  • Subject to
  • yi wt?d(xi) b 1 ?i
  • ?i 0
  • dk 0
  • K(xi,xj) ?k dk Kk(xi,xj)
  • Very efficient gradient descent based solution
  • Similar to Rakotomamonjy et al. 2007 but more
    accurate as our search space is larger.

18
Final Algorithm
  • Initialise d0 randomly
  • Repeat until convergence
  • Form K(x,y) ?k dkn Kk(x,y)
  • Use any SVM solver to solve the standard SVM
    problem with kernel K and obtain ?.
  • Update dkn1 dkn ?n(?k ½?tYKkY?)
  • Project dn1 back onto the feasible set if it
    does not satisfy the constraints dn1 0

19
Kernel Generalizations
  • The learnt kernel can now have any functional
    form as long as
  • ?dK(d) exists and is continuous.
  • K(d) is strictly positive definite for feasible
    d.
  • For example, K(d) ?k dk0 ??l exp( dkl ?2)

20
Regularization Generalizations
  • Any regularizer can be used as long as it has
    continuous first derivative with respect to d
  • We can now put Gaussian rather than Laplacian
    priors on the kernel weights.
  • We can, once again, have negative weights.

21
Loss Function Generalizations
  • The loss function can be generalized to handle
  • Regression.
  • Novelty detection (1 class SVM).
  • Multi-class classification.
  • Ordinal Regression.
  • Ranking.

22
Penalties
  • Our formulation is no longer convex
  • Somehow, this seems not to make much of a
    difference.
  • Furthermore, early termination can sometimes
    improve results,

23
Regression on Hot or Not Training Data
7.3
6.5
9.4
7.5
7.7
7.7
6.5
6.9
7.4
8.7
24
Predict Hotness

25
Learning Discriminative Object Features/Pixels
Write a Comment
User Comments (0)
About PowerShow.com