Title: Sparse Approximations to Bayesian Gaussian Processes
1Sparse Approximations toBayesian Gaussian
Processes
- Matthias Seeger
- University of Edinburgh
2Collaborators
- Neil Lawrence (Sheffield)
- Chris Williams (Edinburgh)
- Ralf Herbrich (MSR Cambridge)
3Overview of the Talk
- Gaussian processes and approximations
- Understanding sparse schemes aslikelihood
approximations - Two schemes and their relationships
- Fast greedy selection for the projected latent
variables scheme (GP regression)
4Why Sparse Approximations?
- GPs lead to very powerful Bayesian methods for
function fitting, classification, etc. Yet
(Almost) Nobody uses them! - Reason Horrible scaling O(n3)
- If sparse approximations work, there is a host of
applications, e.g. as building blocks in Bayesian
networks, etc.
5Gaussian Process Models
Gaussian prior(dense),kernel K
- Target y separated by latent u from all other
variables Inference a finite problem
6Parameterisation
- Data D (xi,yi) i1,,n.Latent outputs u
(u1,,un). - Approximate posterior process P(u() D)by GP
Q(u() D)
7GP Approximations
- Most (non-MCMC) GP approximations use this
representation - Exact computation of Q(u D) intractable,
needs
- Attractive for sparse approximationsSequential
fitting of Q(u D) to P(u D)
8Assumed Density Filtering
9Towards Sparsity
- ADF Bayesian Online Opper.Multiple updates
Cavity method Opper, Winther, EP Minka - Generalizations EP Minka, ADATAP
Csato,Opper,Winther COW - Sequential updates suitable for sparse online or
greedy methods
10Likelihood Approximations
- Active set I ½ 1,,n, I d n
- Several sparse schemes can be understood
aslikelihood approximations
11Likelihood Approximations (II)
y1
y4
u1
u4
x1
x4
Active Set I 2,3
12Likelihood Approximations (III)
- For such sparse schemes
- O(d2) parameters at most
- Prediction in O(d2), O(d) for mean only
- Approximations to marginal likelihood
(variational lower bound, ADATAP COW), PAC
bounds Seeger, etc., become cheap as well!
13Two Schemes
- IVM Lawrence, Seeger, Herbrich LSHADF with
fast greedy forward selection - Sparse Greedy GPR Smola, Bartlett SBGreedy,
expensive. Can be sped upProjected Latent
Variables Seeger, Lawrence, Williams. More
generalSparse batch ADATAP COW - Not here Sparse Online GP Csato, Opper
14Informative Vector Machine
- ADF, stopped after dinclusions could do
deletions, exchanges - Fast greedy forward selection using criteria
known in active learning - Faster than SVM on hard MNIST binary tasks, yet
probabilistic (error bars, etc.)
15Why So Simple?
- Locality Property of ADFMarginal Qnew(ui) in
O(1) from Q(ui) - Locality Property and GaussianityRelations
like Fast evaluation of differential
criteria
16KL-Optimal Projections
17KL-Optimal Projections (II)
- For Gaussian likelihood
- Can be used online or batch
- A bit unfortunate We use relative entropy both
ways around!
18Projected Latent Variables
- Full GPR samples uI P(uI), uR P(uR uI), y
N(y u, s2 I). - Instead y N(y Eu uI, s2 I). Latent
variables uR replaced by projections in
likelihood SB (without interpret.) - Note Sparse batch ADATAP COW more general
(non-Gaussian likelihoods)
19Fast Greedy Selections
- With this likelihood approximation, typical
forward selection criteria (MAP SB diff.
entropy, info-gain LSH) are too expensive - Problem Upon inclusion, latent ui is coupled
with all targets y - Cheap criterion Ignore most couplings for score
evaluation (not for inclusion!)
20Yet Another Approximation
- To score xi, we approximate Qnew(u D) after
inclusion of i by
21Fast Greedy Selections (II)
- Leads to O(1) criteria.Cost of searching over
all remaining points dominated by cost for
inclusion - Can easily be generalized to allow for couplings
between ui and some targets, if desired - Can be done for sparse batch ADATAP as well
22Marginal Likelihood
- The marginal likelihood is
- Can be optimized efficiently w.r.t. s and kernel
parameters, O(n d (dp)) per gradient, p number
of parameters - Keep I fixed during line searches, reselect for
search directions
23Conclusions
- Most sparse approximations can be understood as
likelihood approximations - Several schemes available, all O(n d2), yet
constants do matter here! - Fast information-theoretic criteria effective for
classification Extension to active
learning straightforward
24Conclusions (II)
- Missing Experimental comparison, esp. to test
effectiveness of marginal likelihood optimization - Extensions
- C classes Easy in O(n d2 C2), maybe in O(n d2 C)
- Integrate with Bayesian networksFriedman,
Nachman