Title: Graphical model software for machine learning
1Graphical modelsoftware for machine learning
- Kevin Murphy
- University of British Columbia
-
December, 2005
2Outline
- Discriminative models for iid data
- Beyond iid data conditional random fields
- Beyond supervised learning generative models
- Beyond optimization Bayesian models
3Supervised learning as Bayesian inference
Training
Testing
?
?
Y1
Yn
YN
Y
Y
X1
Xn
XN
X
X
N
4Supervised learning as optimization
Training
Testing
?
?
Y1
Yn
YN
Y
Y
X1
Xn
XN
X
X
N
5Example logistic regression
- Let yn 2 1,,C be given by a softmax
- Maximize conditional log likelihood
- Max margin solution
6Outline
- Discriminative models for iid data
- Beyond iid data conditional random fields
- Beyond supervised learning generative models
- Beyond optimization Bayesian models
71D chain CRFs for sequence labeling
A 1D conditional random field (CRF) is an
extension of logistic regressionto the case
where the output labels are sequences, yn 2
1,,Cm
Edge potential
Local evidence
?
?ij
Yn1
Yn2
Ynm
?i
Xn
82D Lattice CRFs for pixel labeling
A conditional randomfield (CRF) is a
discriminative modelof P(yx). The edge
potentials?ij are image dependent.
92D Lattice MRFs for pixel labeling
Local evidence
Potential function
Partition function
A Markov Random Field (MRF) is an
undirectedgraphical model. Here we model
correlation between pixel labels using
?ij(yi,yj). We also have a per-pixelgenerative
model of observations P(xiyi)
10Tree-structured CRFs
- Used in parts-based object detection
- Yi is location of part i in image
nose
eyeR
eyeL
mouth
Fischler Elschlager, "The representation and
matching of pictorial structures,
PAMI73 Felzenszwalb Huttenlocher, "Pictorial
Structures for Object Recognition," IJCV05
11General CRFs
- In general, the graph may have arbitrary
structure - eg for collective web page classification,nodesu
rls, edgeshyperlinks - The potentials are in general defined on cliques,
not just edges
12Factor graphs
Square nodes factors (potentials) Round nodes
random variables Graph structure bipartite
13Potential functions
- For the local evidence, we can use a
discriminative classifier (trained iid) - For the edge compatibilities, we can use a
maxent/ loglinear form, using pre-defined features
14Restricted potential functions
- For some applications (esp in vision), we often
use a Potts model of the form - We can generalize this for ordered labels (eg
discretization of continuous states)
15(No Transcript)
16Learning CRFs
- If the log likelihood is
- then the gradient is
Tied params
cliques
Gradient features expected features
17Learning CRFs
- Given the gradient rd, one can find the global
optimum using first or second order optimization
methods, such as - Conjugate gradient
- Limited memory BFGS
- Stochastic meta descent (SMD)?
- The bottleneck is computing the expected features
needed for the gradient
18Exact inference
- For 1D chains, one can compute P(yi,i1x)
exactly in O(N K2) time using belief propagation
(BP forwards backwards algorithm) - For restricted potentials (eg ?ij?(? l)), one
can do this in O(NK) time using FFT-like tricks - This can be generalized to trees.
19Sum-product vs max-product
- We use sum-product to compute marginal
probabilities needed for learning - We use max-product to find the most probable
assignment (Viterbi decoding) - We can also compute max-marginals
20Complexity of exact inference
In general, the running time is ?(N Kw), where w
is the treewidthof the graph this is the size
of the maximal clique of the triangulatedgraph
(assuming an optimal elimination ordering). For
chains and trees, w 2. For n n lattices, w
O(n).
21Approximate sum-product
Algorithm Potential (pairwise) Time Nnum nodes,K num states,I num iterations
BP(exact iff tree) General O(N K2 I)
BPFFT(exact iff tree) Restricted O(N K I)
Generalized BP General O(N K2c I)c cluster size
Gibbs General O(N K I)
Swendsen-Wang General O(N K I)
Mean field General O(N K I)
22Approximate max-product
Algorithm Potential (pairwise) Time Nnum nodes,K num states,I num iterations
BP (exact iff tree) General O(N K2 I)
BPDT (exact iff tree) Restricted O(N K I)
Generalized BP General O(N K2c I)c cluster size
Graph-cuts(exact iff K2) Restricted O(N2 K I) ?
ICM (iterated conditional modes) General O(N K I)
SLS (stochastic local search) General O(N K I)
23Learning intractable CRFs
- We can use approximate inference and hope the
gradient is good enough. - If we use max-product, we are doing Viterbi
training (cf perceptron rule) - Or we can use other techniques, such as pseudo
likelihood, which does not need inference.
24Pseudo-likelihood
25Software for inference and learning in 1D CRFs
- Various packages
- Mallet (McCallum et al) Java
- Crf.sourceforge.net (Sarawagi, Cohen) Java
- My code matlab (just a toy, not integrated with
BNT) - Ben Taskar says he will soon release his Max
Margin Markov net code (which uses LP for
inference and QP for learning). - Nothing standard, emphasis on NLP apps
26Software for inference in general CRFs/ MRFs
- Max-product C code for GC, BP, TRP and ICM
(for Lattice2) by Rick Szeliski et al - A comparative study of energy minimization
methods for MRFs, Rick Szeliksi, Ramin Zabih,
Daniel Scharstein, Olga Veksler, Vladimir
Kolmogorov, Aseem Agarwala, Marsall Tappen,
Carsten Rother - Sum-product for Gaussian MRFs GMRFlib, C code by
Havard Rue (exact inference) - Sum-product various other ad hoc pieces
- My matlab BP code (MRF2)
- Rivasseaus C code for BP, Gibbs, tree-sampling
(factor graphs) - Metlzers C code for BP, GBP, Gibbs, MF
(Lattice2)
27Software for learning general MRFs/CRFs
- Hardly any!
- Parises matlab code (approx gradient, pseudo
likelihood, CD, etc) - My matlab code (IPF, approx gradient just a toy
not integrated with BNT)
28Structure of ideal toolbox
Generator/GUI/file
train
testData
infer
decisionEngine
performance
decision
visualize
summarize
utilities
29Structure of BNT
LeRay
Shan
Generator/GUI/file
GraphsCPDs
Cell array
BPJtree MCMC
EM StructuralEM
train
testData
NodeIds
VarElim
GraphsCPDs
Cell array
infer
JtreeVarElim
decisionEngine
policy
Array, Gaussian, samples
N1 (MAP)
visualize
summarize
LIMID
30Outline
- Discriminative models for iid data
- Beyond iid data conditional random fields
- Beyond supervised learning generative models
- Beyond optimization Bayesian models
31Unsupervised learning why?
- Labeling data is time-consuming.
- Often not clear what label to use.
- Complex objects often not describable with a
single discrete label. - Humans learn without labels.
- Want to discover novel patterns/ structure.
32Unsupervised learning what?
- Clusters (eg GMM)
- Low dim manifolds (eg PCA)
- Graph structure (eg biology, social networks)
- Features (eg maxent models of language and
texture) - Objects (eg sprite models in vision)
33Unsupervised learning of objects from video
Frey and Jojic Williams and Titsias et al
34Unsupervised learning issues
- Objective function not as obvious as in
supervised learning. Usually try to maximize
likelihood (measure of data compression). - Local minima (non convex objective).
- Uses inference as subroutine (can be slow no
worse than discriminative learning)
35Unsupervised learning how?
- Construct a generative model (eg a Bayes net).
- Perform inference.
- May have to use approximations such as maximum
likelihood and BP. - Cannot use max likelihood for model selection
36A comparison of BN software
www.ai.mit.edu/murphyk/Software/Bayes/bnsoft.html
37Popular BN software
- BNT (matlab)
- Intels PNL (C)
- Hugin (commercial)
- Netica (commercial)
- GMTk (free .exe from Jeff Bilmes)
38Outline
- Discriminative models for iid data
- Beyond iid data conditional random fields
- Beyond supervised learning generative models
- Beyond optimization Bayesian models
39Bayesian inference why?
- It is optimal.
- It can easily incorporate prior knowledge (esp.
useful for small n, large p problems). - It properly reports confidence in output (useful
for combining estimates, and for risk-averse
applications). - It separates models from algorithms.
40Bayesian inference how?
- Since we want to integrate, we cannot use
max-product. - Since the unknown parameters are continuous, we
cannot use sum-product. - But we can use EP (expectation propagation),
which is similar to BP. - We can also use variational inference.
- Or MCMC (eg Gibbs sampling).
41General purposeBayesian software
- BUGS (Gibbs sampling)
- VIBES (variational message passing)
- Minka and Winns toolbox (infer.net)
42Structure of ideal Bayesian toolbox
Generator/ GUI/ file
train
testData
infer
decisionEngine
performance
decision
visualize
summarize
utilities