Christopher%20M.%20Bishop - PowerPoint PPT Presentation

About This Presentation

Title:

Christopher%20M.%20Bishop

Description:

BCS Summer School, Exeter, 2003. Christopher M. Bishop. Old Faithful ... BCS Summer School, Exeter, 2003. Christopher M. Bishop. Minimizing the Cost Function ... – PowerPoint PPT presentation

Number of Views:256

Avg rating:3.0/5.0

Slides: 56

Provided by: cmb71

Category:

more less

Transcript and Presenter's Notes

Title: Christopher%20M.%20Bishop

1
Latent Variables,Mixture Modelsand EM

Christopher M. Bishop

Microsoft Research, Cambridge
BCS Summer SchoolExeter, 2003
2
Overview.

K-means clustering
Gaussian mixtures
Maximum likelihood and EM
Latent variables EM revisited
Bayesian Mixtures of Gaussians

3
Old Faithful
4
Old Faithful Data Set
Time betweeneruptions (minutes)
Duration of eruption (minutes)
5
K-means Algorithm

Goal represent a data set in terms of K clusters
each of which is summarized by a prototype
Initialize prototypes, then iterate between two
phases
E-step assign each data point to nearest
prototype
M-step update prototypes to be the cluster means
Simplest version is based on Euclidean distance
re-scale Old Faithful data

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
Responsibilities

Responsibilities assign data points to
clusterssuch that
Example 5 data points and 3 clusters

16
K-means Cost Function
17
Minimizing the Cost Function

E-step minimize w.r.t.
assigns each data point to nearest prototype
M-step minimize w.r.t
gives
each prototype set to the mean of points in that
cluster
Convergence guaranteed since there is a finite
number of possible settings for the
responsibilities

18
(No Transcript)
19
Limitations of K-means

Hard assignments of data points to clusters
small shift of a data point can flip it to a
different cluster
Not clear how to choose the value of K
Solution replace hard clustering of K-means
with soft probabilistic assignments
Represents the probability distribution of the
data as a Gaussian mixture model

20
The Gaussian Distribution

Multivariate Gaussian
Define precision to be the inverse of the
covariance
In 1-dimension

21
Likelihood Function

Data set
Assume observed data points generated
independently
Viewed as a function of the parameters, this is
known as the likelihood function

22
Maximum Likelihood

Set the parameters by maximizing the likelihood
function
Equivalently maximize the log likelihood

23
Maximum Likelihood Solution

Maximizing w.r.t. the mean gives the sample
mean
Maximizing w.r.t covariance gives the sample
covariance

24
Gaussian Mixtures

Linear super-position of Gaussians
Normalization and positivity require
Can interpret the mixing coefficients as prior
probabilities

25
Example Mixture of 3 Gaussians
26
Contours of Probability Distribution
27
Surface Plot
28
Sampling from the Gaussian

To generate a data point
first pick one of the components with probability
then draw a sample from that component
Repeat these two steps for each new data point

29
Synthetic Data Set
30
Fitting the Gaussian Mixture

We wish to invert this process given the data
set, find the corresponding parameters
mixing coefficients
means
covariances
If we knew which component generated each data
point, the maximum likelihood solution would
involve fitting each component to the
corresponding cluster
Problem the data set is unlabelled
We shall refer to the labels as latent ( hidden)
variables

31
Synthetic Data Set Without Labels
32
Posterior Probabilities

We can think of the mixing coefficients as prior
probabilities for the components
For a given value of we can evaluate the
corresponding posterior probabilities, called
responsibilities
These are given from Bayes theorem by

33
Posterior Probabilities (colour coded)
34
Posterior Probability Map
35
Maximum Likelihood for the GMM

The log likelihood function takes the form
Note sum over components appears inside the log
There is no closed form solution for maximum
likelihood

36
Problems and Solutions

How to maximize the log likelihood
solved by expectation-maximization (EM) algorithm
How to avoid singularities in the likelihood
function
solved by a Bayesian treatment
How to choose number K of components
also solved by a Bayesian treatment

37
EM Algorithm Informal Derivation

Let us proceed by simply differentiating the log
likelihood
Setting derivative with respect to equal to
zero givesgivingwhich is simply the
weighted mean of the data

38
EM Algorithm Informal Derivation

Similarly for the covariances
For mixing coefficients use a Lagrange multiplier
to give

39
EM Algorithm Informal Derivation

The solutions are not closed form since they are
coupled
Suggests an iterative scheme for solving them
Make initial guesses for the parameters
Alternate between the following two stages
E-step evaluate responsibilities
M-step update parameters using ML results

40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
EM Latent Variable Viewpoint

Binary latent variables
describing which component generated each data
point
Conditional distribution of observed variable
Prior distribution of latent variables
Marginalizing over the latent variables we obtain

47
Expected Value of Latent Variable

From Bayes theorem

48
Complete and Incomplete Data
complete
incomplete
49
Latent Variable View of EM

If we knew the values for the latent variables,
we would maximize the complete-data log
likelihoodwhich gives a trivial closed-form
solution (fit each component to the corresponding
set of data points)
We dont know the values of the latent variables
However, for given parameter values we can
compute the expected values of the latent
variables

50
Expected Complete-Data Log Likelihood

Suppose we make a guess for the parameter
values (means, covariances and mixing
coefficients)
Use these to evaluate the responsibilities
Consider expected complete-data log likelihood
where responsibilities are computed using
We are implicitly filling in latent variables
with best guess
Keeping the responsibilities fixed and maximizing
with respect to the parameters give the previous
results

51
EM in General

Consider arbitrary distribution over the
latent variables
The following decomposition always holdswhere

52
Decomposition
53
Optimizing the Bound

E-step maximize with respect to
equivalent to minimizing KL divergence
sets equal to the posterior distribution
M-step maximize bound with respect to
equivalent to maximizing expected complete-data
log likelihood
Each EM cycle must increase incomplete-data
likelihood unless already at a (local) maximum

54
E-step
55
M-step

Write a Comment

User Comments (0)