Graphical model software for machine learning - PowerPoint PPT Presentation

About This Presentation

Title:

Graphical model software for machine learning

Description:

For the local evidence, we can use a discriminative classifier (trained iid) ... Uses inference as subroutine (can be slow no worse than discriminative learning) ... – PowerPoint PPT presentation

Number of Views:136

Avg rating:3.0/5.0

Slides: 43

Provided by: nirf2

Category:

more less

Transcript and Presenter's Notes

Title: Graphical model software for machine learning

1
Graphical modelsoftware for machine learning

Kevin Murphy
University of British Columbia

December, 2005
2
Outline

Discriminative models for iid data
Beyond iid data conditional random fields
Beyond supervised learning generative models
Beyond optimization Bayesian models

3
Supervised learning as Bayesian inference
Training
Testing
?
?
Y1
Yn
YN
Y
Y
X1
Xn
XN
X
X
N
4
Supervised learning as optimization
Training
Testing
?
?
Y1
Yn
YN
Y
Y
X1
Xn
XN
X
X
N
5
Example logistic regression

Let yn 2 1,,C be given by a softmax
Maximize conditional log likelihood
Max margin solution

6
Outline

Discriminative models for iid data
Beyond iid data conditional random fields
Beyond supervised learning generative models
Beyond optimization Bayesian models

7
1D chain CRFs for sequence labeling
A 1D conditional random field (CRF) is an
extension of logistic regressionto the case
where the output labels are sequences, yn 2
1,,Cm
Edge potential
Local evidence
?
?ij
Yn1
Yn2
Ynm
?i
Xn
8
2D Lattice CRFs for pixel labeling
A conditional randomfield (CRF) is a
discriminative modelof P(yx). The edge
potentials?ij are image dependent.
9
2D Lattice MRFs for pixel labeling
Local evidence
Potential function
Partition function
A Markov Random Field (MRF) is an
undirectedgraphical model. Here we model
correlation between pixel labels using
?ij(yi,yj). We also have a per-pixelgenerative
model of observations P(xiyi)
10
Tree-structured CRFs

Used in parts-based object detection
Yi is location of part i in image

nose
eyeR
eyeL
mouth
Fischler Elschlager, "The representation and
matching of pictorial structures,
PAMI73 Felzenszwalb Huttenlocher, "Pictorial
Structures for Object Recognition," IJCV05
11
General CRFs

In general, the graph may have arbitrary
structure
eg for collective web page classification,nodesu
rls, edgeshyperlinks
The potentials are in general defined on cliques,
not just edges

12
Factor graphs
Square nodes factors (potentials) Round nodes
random variables Graph structure bipartite
13
Potential functions

For the local evidence, we can use a
discriminative classifier (trained iid)
For the edge compatibilities, we can use a
maxent/ loglinear form, using pre-defined features

14
Restricted potential functions

For some applications (esp in vision), we often
use a Potts model of the form
We can generalize this for ordered labels (eg
discretization of continuous states)

15
(No Transcript)
16
Learning CRFs

If the log likelihood is
then the gradient is

Tied params
cliques
Gradient features expected features
17
Learning CRFs

Given the gradient rd, one can find the global
optimum using first or second order optimization
methods, such as
Conjugate gradient
Limited memory BFGS
Stochastic meta descent (SMD)?
The bottleneck is computing the expected features
needed for the gradient

18
Exact inference

For 1D chains, one can compute P(yi,i1x)
exactly in O(N K2) time using belief propagation
(BP forwards backwards algorithm)
For restricted potentials (eg ?ij?(? l)), one
can do this in O(NK) time using FFT-like tricks
This can be generalized to trees.

19
Sum-product vs max-product

We use sum-product to compute marginal
probabilities needed for learning
We use max-product to find the most probable
assignment (Viterbi decoding)
We can also compute max-marginals

20
Complexity of exact inference
In general, the running time is ?(N Kw), where w
is the treewidthof the graph this is the size
of the maximal clique of the triangulatedgraph
(assuming an optimal elimination ordering). For
chains and trees, w 2. For n n lattices, w
O(n).
21
Approximate sum-product
Algorithm Potential (pairwise) Time Nnum nodes,K num states,I num iterations
BP(exact iff tree) General O(N K2 I)
BPFFT(exact iff tree) Restricted O(N K I)
Generalized BP General O(N K2c I)c cluster size
Gibbs General O(N K I)
Swendsen-Wang General O(N K I)
Mean field General O(N K I)
22
Approximate max-product
Algorithm Potential (pairwise) Time Nnum nodes,K num states,I num iterations
BP (exact iff tree) General O(N K2 I)
BPDT (exact iff tree) Restricted O(N K I)
Generalized BP General O(N K2c I)c cluster size
Graph-cuts(exact iff K2) Restricted O(N2 K I) ?
ICM (iterated conditional modes) General O(N K I)
SLS (stochastic local search) General O(N K I)
23
Learning intractable CRFs

We can use approximate inference and hope the
gradient is good enough.
If we use max-product, we are doing Viterbi
training (cf perceptron rule)
Or we can use other techniques, such as pseudo
likelihood, which does not need inference.

24
Pseudo-likelihood
25
Software for inference and learning in 1D CRFs

Various packages
Mallet (McCallum et al) Java
Crf.sourceforge.net (Sarawagi, Cohen) Java
My code matlab (just a toy, not integrated with
BNT)
Ben Taskar says he will soon release his Max
Margin Markov net code (which uses LP for
inference and QP for learning).
Nothing standard, emphasis on NLP apps

26
Software for inference in general CRFs/ MRFs

Max-product C code for GC, BP, TRP and ICM
(for Lattice2) by Rick Szeliski et al
A comparative study of energy minimization
methods for MRFs, Rick Szeliksi, Ramin Zabih,
Daniel Scharstein, Olga Veksler, Vladimir
Kolmogorov, Aseem Agarwala, Marsall Tappen,
Carsten Rother
Sum-product for Gaussian MRFs GMRFlib, C code by
Havard Rue (exact inference)
Sum-product various other ad hoc pieces
My matlab BP code (MRF2)
Rivasseaus C code for BP, Gibbs, tree-sampling
(factor graphs)
Metlzers C code for BP, GBP, Gibbs, MF
(Lattice2)

27
Software for learning general MRFs/CRFs

Hardly any!
Parises matlab code (approx gradient, pseudo
likelihood, CD, etc)
My matlab code (IPF, approx gradient just a toy
not integrated with BNT)

28
Structure of ideal toolbox
Generator/GUI/file
train
testData
infer
decisionEngine
performance
decision
visualize
summarize
utilities
29
Structure of BNT
LeRay
Shan
Generator/GUI/file
GraphsCPDs
Cell array
BPJtree MCMC
EM StructuralEM
train
testData
NodeIds
VarElim
GraphsCPDs
Cell array
infer
JtreeVarElim
decisionEngine
policy
Array, Gaussian, samples
N1 (MAP)
visualize
summarize
LIMID
30
Outline

Discriminative models for iid data
Beyond iid data conditional random fields
Beyond supervised learning generative models
Beyond optimization Bayesian models

31
Unsupervised learning why?

Labeling data is time-consuming.
Often not clear what label to use.
Complex objects often not describable with a
single discrete label.
Humans learn without labels.
Want to discover novel patterns/ structure.

32
Unsupervised learning what?

Clusters (eg GMM)
Low dim manifolds (eg PCA)
Graph structure (eg biology, social networks)
Features (eg maxent models of language and
texture)
Objects (eg sprite models in vision)

33
Unsupervised learning of objects from video
Frey and Jojic Williams and Titsias et al
34
Unsupervised learning issues

Objective function not as obvious as in
supervised learning. Usually try to maximize
likelihood (measure of data compression).
Local minima (non convex objective).
Uses inference as subroutine (can be slow no
worse than discriminative learning)

35
Unsupervised learning how?

Construct a generative model (eg a Bayes net).
Perform inference.
May have to use approximations such as maximum
likelihood and BP.
Cannot use max likelihood for model selection

36
A comparison of BN software
www.ai.mit.edu/murphyk/Software/Bayes/bnsoft.html
37
Popular BN software

BNT (matlab)
Intels PNL (C)
Hugin (commercial)
Netica (commercial)
GMTk (free .exe from Jeff Bilmes)

38
Outline

Discriminative models for iid data
Beyond iid data conditional random fields
Beyond supervised learning generative models
Beyond optimization Bayesian models

39
Bayesian inference why?

It is optimal.
It can easily incorporate prior knowledge (esp.
useful for small n, large p problems).
It properly reports confidence in output (useful
for combining estimates, and for risk-averse
applications).
It separates models from algorithms.

40
Bayesian inference how?