Christopher%20M.%20Bishop - PowerPoint PPT Presentation

About This Presentation
Title:

Christopher%20M.%20Bishop

Description:

BCS Summer School, Exeter, 2003. Christopher M. Bishop. Old Faithful ... BCS Summer School, Exeter, 2003. Christopher M. Bishop. Minimizing the Cost Function ... – PowerPoint PPT presentation

Number of Views:254
Avg rating:3.0/5.0
Slides: 56
Provided by: cmb71
Category:

less

Transcript and Presenter's Notes

Title: Christopher%20M.%20Bishop


1
Latent Variables,Mixture Modelsand EM
  • Christopher M. Bishop

Microsoft Research, Cambridge
BCS Summer SchoolExeter, 2003
2
Overview.
  • K-means clustering
  • Gaussian mixtures
  • Maximum likelihood and EM
  • Latent variables EM revisited
  • Bayesian Mixtures of Gaussians

3
Old Faithful
4
Old Faithful Data Set
Time betweeneruptions (minutes)
Duration of eruption (minutes)
5
K-means Algorithm
  • Goal represent a data set in terms of K clusters
    each of which is summarized by a prototype
  • Initialize prototypes, then iterate between two
    phases
  • E-step assign each data point to nearest
    prototype
  • M-step update prototypes to be the cluster means
  • Simplest version is based on Euclidean distance
  • re-scale Old Faithful data

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
Responsibilities
  • Responsibilities assign data points to
    clusterssuch that
  • Example 5 data points and 3 clusters

16
K-means Cost Function
17
Minimizing the Cost Function
  • E-step minimize w.r.t.
  • assigns each data point to nearest prototype
  • M-step minimize w.r.t
  • gives
  • each prototype set to the mean of points in that
    cluster
  • Convergence guaranteed since there is a finite
    number of possible settings for the
    responsibilities

18
(No Transcript)
19
Limitations of K-means
  • Hard assignments of data points to clusters
    small shift of a data point can flip it to a
    different cluster
  • Not clear how to choose the value of K
  • Solution replace hard clustering of K-means
    with soft probabilistic assignments
  • Represents the probability distribution of the
    data as a Gaussian mixture model

20
The Gaussian Distribution
  • Multivariate Gaussian
  • Define precision to be the inverse of the
    covariance
  • In 1-dimension

21
Likelihood Function
  • Data set
  • Assume observed data points generated
    independently
  • Viewed as a function of the parameters, this is
    known as the likelihood function

22
Maximum Likelihood
  • Set the parameters by maximizing the likelihood
    function
  • Equivalently maximize the log likelihood

23
Maximum Likelihood Solution
  • Maximizing w.r.t. the mean gives the sample
    mean
  • Maximizing w.r.t covariance gives the sample
    covariance

24
Gaussian Mixtures
  • Linear super-position of Gaussians
  • Normalization and positivity require
  • Can interpret the mixing coefficients as prior
    probabilities

25
Example Mixture of 3 Gaussians
26
Contours of Probability Distribution
27
Surface Plot
28
Sampling from the Gaussian
  • To generate a data point
  • first pick one of the components with probability
  • then draw a sample from that component
  • Repeat these two steps for each new data point

29
Synthetic Data Set
30
Fitting the Gaussian Mixture
  • We wish to invert this process given the data
    set, find the corresponding parameters
  • mixing coefficients
  • means
  • covariances
  • If we knew which component generated each data
    point, the maximum likelihood solution would
    involve fitting each component to the
    corresponding cluster
  • Problem the data set is unlabelled
  • We shall refer to the labels as latent ( hidden)
    variables

31
Synthetic Data Set Without Labels
32
Posterior Probabilities
  • We can think of the mixing coefficients as prior
    probabilities for the components
  • For a given value of we can evaluate the
    corresponding posterior probabilities, called
    responsibilities
  • These are given from Bayes theorem by

33
Posterior Probabilities (colour coded)
34
Posterior Probability Map
35
Maximum Likelihood for the GMM
  • The log likelihood function takes the form
  • Note sum over components appears inside the log
  • There is no closed form solution for maximum
    likelihood

36
Problems and Solutions
  • How to maximize the log likelihood
  • solved by expectation-maximization (EM) algorithm
  • How to avoid singularities in the likelihood
    function
  • solved by a Bayesian treatment
  • How to choose number K of components
  • also solved by a Bayesian treatment

37
EM Algorithm Informal Derivation
  • Let us proceed by simply differentiating the log
    likelihood
  • Setting derivative with respect to equal to
    zero givesgivingwhich is simply the
    weighted mean of the data

38
EM Algorithm Informal Derivation
  • Similarly for the covariances
  • For mixing coefficients use a Lagrange multiplier
    to give

39
EM Algorithm Informal Derivation
  • The solutions are not closed form since they are
    coupled
  • Suggests an iterative scheme for solving them
  • Make initial guesses for the parameters
  • Alternate between the following two stages
  • E-step evaluate responsibilities
  • M-step update parameters using ML results

40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
EM Latent Variable Viewpoint
  • Binary latent variables
    describing which component generated each data
    point
  • Conditional distribution of observed variable
  • Prior distribution of latent variables
  • Marginalizing over the latent variables we obtain

47
Expected Value of Latent Variable
  • From Bayes theorem

48
Complete and Incomplete Data
complete
incomplete
49
Latent Variable View of EM
  • If we knew the values for the latent variables,
    we would maximize the complete-data log
    likelihoodwhich gives a trivial closed-form
    solution (fit each component to the corresponding
    set of data points)
  • We dont know the values of the latent variables
  • However, for given parameter values we can
    compute the expected values of the latent
    variables

50
Expected Complete-Data Log Likelihood
  • Suppose we make a guess for the parameter
    values (means, covariances and mixing
    coefficients)
  • Use these to evaluate the responsibilities
  • Consider expected complete-data log likelihood
    where responsibilities are computed using
  • We are implicitly filling in latent variables
    with best guess
  • Keeping the responsibilities fixed and maximizing
    with respect to the parameters give the previous
    results

51
EM in General
  • Consider arbitrary distribution over the
    latent variables
  • The following decomposition always holdswhere

52
Decomposition
53
Optimizing the Bound
  • E-step maximize with respect to
  • equivalent to minimizing KL divergence
  • sets equal to the posterior distribution
  • M-step maximize bound with respect to
  • equivalent to maximizing expected complete-data
    log likelihood
  • Each EM cycle must increase incomplete-data
    likelihood unless already at a (local) maximum

54
E-step
55
M-step
Write a Comment
User Comments (0)
About PowerShow.com