Title: Christopher M' Bishop
1Latent Variables,Mixture Modelsand EM
Microsoft Research, Cambridge
BCS Summer SchoolExeter, 2003
2Overview
- K-means clustering
- Gaussian mixtures
- Maximum likelihood and EM
- Probabilistic graphical models
- Latent variables EM revisited
- Bayesian Mixtures of Gaussians
- Variational Inference
- VIBES
3Old Faithful
4Old Faithful Data Set
Time betweeneruptions (minutes)
Duration of eruption (minutes)
5K-means Algorithm
- Goal represent a data set in terms of K clusters
each of which is summarized by a prototype - Initialize prototypes, then iterate between two
phases - E-step assign each data point to nearest
prototype - M-step update prototypes to be the cluster means
- Simplest version is based on Euclidean distance
- re-scale Old Faithful data
6(No Transcript)
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15Responsibilities
- Responsibilities assign data points to
clusterssuch that - Example 5 data points and 3 clusters
16K-means Cost Function
17Minimizing the Cost Function
- E-step minimize w.r.t.
- assigns each data point to nearest prototype
- M-step minimize w.r.t
- gives
- each prototype set to the mean of points in that
cluster - Convergence guaranteed since there is a finite
number of possible settings for the
responsibilities
18(No Transcript)
19(No Transcript)
20Limitations of K-means
- Hard assignments of data points to clusters
small shift of a data point can flip it to a
different cluster - Not clear how to choose the value of K
- Solution replace hard clustering of K-means
with soft probabilistic assignments - Represents the probability distribution of the
data as a Gaussian mixture model
21The Gaussian Distribution
- Multivariate Gaussian
- Define precision to be the inverse of the
covariance - In 1-dimension
22Likelihood Function
- Data set
- Assume observed data points generated
independently - Viewed as a function of the parameters, this is
known as the likelihood function
23Maximum Likelihood
- Set the parameters by maximizing the likelihood
function - Equivalently maximize the log likelihood
24Maximum Likelihood Solution
- Maximizing w.r.t. the mean gives the sample
mean - Maximizing w.r.t covariance gives the sample
covariance
25Bias of Maximum Likelihood
- Consider the expectations of the maximum
likelihood estimates under the Gaussian
distribution - The maximum likelihood solution systematically
under-estimates the covariance - This is an example of over-fitting
26Intuitive Explanation of Over-fitting
27Unbiased Variance Estimate
- Clearly we can remove the bias by usingsince
this gives - Arises naturally in a Bayesian treatment (see
later) - For an infinite data set the two expressions are
equal
28Gaussian Mixtures
- Linear super-position of Gaussians
- Normalization and positivity require
- Can interpret the mixing coefficients as prior
probabilities
29Example Mixture of 3 Gaussians
30Contours of Probability Distribution
31Surface Plot
32Sampling from the Gaussian
- To generate a data point
- first pick one of the components with probability
- then draw a sample from that component
- Repeat these two steps for each new data point
33Synthetic Data Set
34Fitting the Gaussian Mixture
- We wish to invert this process given the data
set, find the corresponding parameters - mixing coefficients
- means
- covariances
- If we knew which component generated each data
point, the maximum likelihood solution would
involve fitting each component to the
corresponding cluster - Problem the data set is unlabelled
- We shall refer to the labels as latent ( hidden)
variables
35Synthetic Data Set Without Labels
36Posterior Probabilities
- We can think of the mixing coefficients as prior
probabilities for the components - For a given value of we can evaluate the
corresponding posterior probabilities, called
responsibilities - These are given from Bayes theorem by
37Posterior Probabilities (colour coded)
38Posterior Probability Map
39Maximum Likelihood for the GMM
- The log likelihood function takes the form
- Note sum over components appears inside the log
- There is no closed form solution for maximum
likelihood
40Over-fitting in Gaussian Mixture Models
- Singularities in likelihood function when a
component collapses onto a data pointthen
consider - Likelihood function gets larger as we add more
components (and hence parameters) to the model - not clear how to choose the number K of
components
41Problems and Solutions
- How to maximize the log likelihood
- solved by expectation-maximization (EM) algorithm
- How to avoid singularities in the likelihood
function - solved by a Bayesian treatment
- How to choose number K of components
- also solved by a Bayesian treatment
42EM Algorithm Informal Derivation
- Let us proceed by simply differentiating the log
likelihood - Setting derivative with respect to equal to
zero givesgivingwhich is simply the
weighted mean of the data
43EM Algorithm Informal Derivation
- Similarly for the covariances
- For mixing coefficients use a Lagrange multiplier
to give
44EM Algorithm Informal Derivation
- The solutions are not closed form since they are
coupled - Suggests an iterative scheme for solving them
- Make initial guesses for the parameters
- Alternate between the following two stages
- E-step evaluate responsibilities
- M-step update parameters using ML results
45(No Transcript)
46(No Transcript)
47(No Transcript)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51Digression Probabilistic Graphical Models
- Graphical representation of a probabilistic model
- Each variable corresponds to a node in the graph
- Links in the graph denote relations between
variables - Motivation
- visualization of models and motivation for new
models - graphical determination of conditional
independence - complex calculations (inference) performed using
graphical operations (e.g. forward-backward for
HMM) - Here we consider directed graphs
52Example 3 Variables
- General distribution over 3 variables
- Apply product rule of probability twice
-
- Express as a directed graph
53General Decomposition Formula
- Joint distribution is product of conditionals,
conditioned on parent nodes - Example
54EM Latent Variable Viewpoint
- Binary latent variables
describing which component generated each data
point - Conditional distribution of observed variable
- Prior distribution of latent variables
- Marginalizing over the latent variables we obtain
55Expected Value of Latent Variable
56Graphical Representation of GMM
57Complete and Incomplete Data
complete
incomplete
58Graph for Complete-Data Model
59Latent Variable View of EM
- If we knew the values for the latent variables,
we would maximize the complete-data log
likelihoodwhich gives a trivial closed-form
solution (fit each component to the corresponding
set of data points) - We dont know the values of the latent variables
- However, for given parameter values we can
compute the expected values of the latent
variables
60Expected Complete-Data Log Likelihood
- Suppose we make a guess for the parameter
values (means, covariances and mixing
coefficients) - Use these to evaluate the responsibilities
- Consider expected complete-data log likelihood
where responsibilities are computed using - We are implicitly filling in latent variables
with best guess - Keeping the responsibilities fixed and maximizing
with respect to the parameters give the previous
results
61K-means Revisited
- Consider GMM with common covariances
- Take limit
- Responsibilities become binary
- Expected complete-data log likelihood becomes
62EM in General
- Consider arbitrary distribution over the
latent variables - The following decomposition always holdswhere
63Decomposition
64Optimizing the Bound
- E-step maximize with respect to
- equivalent to minimizing KL divergence
- sets equal to the posterior distribution
- M-step maximize bound with respect to
- equivalent to maximizing expected complete-data
log likelihood - Each EM cycle must increase incomplete-data
likelihood unless already at a (local) maximum
65E-step
66M-step
67Bayesian Inference
- Include prior distributions over parameters
- Advantages in using conjugate priors
- Example consider a single Gaussian over one
variable - assume variance is known and mean is unknown
- likelihood function for the mean
- Choose Gaussian prior for mean
68Bayesian Inference for a Gaussian
- Posterior (proportional to product of prior and
likelihood) will then also be Gaussianwhere
69Bayesian Inference for a Gaussian
70Bayesian Mixture of Gaussians
- Conjugate priors for the parameters
- Dirichlet prior for mixing coefficients
- Normal-Wishart prior for means and
precisionswhere the Wishart distribution is
given by
71Graphical Representation
- Parameters and latent variables appear on equal
footing
72Variational Inference
- As with many Bayesian models, exact inference for
the mixture of Gaussians is intractable - Approximate Bayesian inference traditionally
based on Laplaces method (local Gaussian
approximation to the posterior) or Markov chain
Monte Carlo - Variational Inference is an alternative, broadly
applicable deterministic approximation scheme
73General View of Variational Inference
- Consider again the previous decomposition, but
where the posterior is over all latent variables
and parameterswhere - Maximizing over would give the true
posterior distribution but this is intractable
by definition
74Factorized Approximation
- Goal choose a family of distributions which are
- sufficiently flexible to give good posterior
approximation - sufficiently simple to remain tractable
- Here we consider factorized distributions
- No further assumptions are required!
- Optimal solution for one factor, keeping the
remained fixed - Coupled solutions so initialize then cyclically
update
75Lower Bound
- Can also be evaluated
- Useful for maths/code verification
- Also useful for model comparison
76(No Transcript)
77(No Transcript)
78Illustration Univariate Gaussian
- Likelihood function
- Conjugate priors
- Factorized variational distribution
79Variational Posterior Distribution
80Initial Configuration
81After Updating
82After Updating
83Converged Solution
84Exact Solution
- For this very simple example there is an exact
solution - Expected precision given by
- Compare with earlier maximum likelihood solution
85Variational Mixture of Gaussians
- Assume factorized posterior distribution
- Gives optimal solution in the formwhere
is a Dirichlet, and is a
Normal-Wishart
86Sufficient Statistics
-
- Small computational overhead compared to maximum
likelihood EM
87Variational Equations for GMM
88Bound vs. K for Old Faithful Data
89Bayesian Model Complexity
90Sparse Bayes for Gaussian Mixture
- Instead of comparing different values of K, start
with a large value and prune out excess
components - Treat mixing coefficients as parameters, and
maximize marginal likelihood Corduneanu
Bishop, AIStats 01 - Gives simple re-estimation equations for the
mixing coefficients interleave with variational
updates
91(No Transcript)
92(No Transcript)
93General Variational Framework
- Currently for each new model
- derive the variational update equations
- write application-specific code to find the
solution - Both stages are time consuming and error prone
- Can we build a general-purpose inference engine
which automates these procedures?
94Lower Bound for GMM
95VIBES
- Variational Inference for Bayesian Networks
- Bishop and Winn (1999)
- A general inference engine using variational
methods - Models specified graphically
96Example Mixtures of Bayesian PCA
97Solution
98Local Computation in VIBES
- A key observation is that in the general
solutionthe update for a particular node (or
group of nodes) depends only on other nodes in
the Markov blanket - Permits a local object-oriented implementation
99Shared Hyper-parameters
100Take-home Messages
- Bayesian mixture of Gaussians
- no singularities
- determines optimal number of components
- Variational inference
- effective solution for Bayesian GMM
- optimizes rigorous bound
- little computational overhead compared to EM
- VIBES
- rapid prototyping of probabilistic models
- graphical specification
101Viewgraphs, tutorials andpublications available
from
- http//research.microsoft.com/cmbishop