Christopher M' Bishop - PowerPoint PPT Presentation

1 / 101
About This Presentation
Title:

Christopher M' Bishop

Description:

solved by expectation-maximization (EM) algorithm ... Each EM cycle must increase incomplete-data likelihood unless already at a (local) maximum ... – PowerPoint PPT presentation

Number of Views:243
Avg rating:3.0/5.0
Slides: 102
Provided by: cmbi5
Category:
Tags: bishop | christopher | em

less

Transcript and Presenter's Notes

Title: Christopher M' Bishop


1
Latent Variables,Mixture Modelsand EM
  • Christopher M. Bishop

Microsoft Research, Cambridge
BCS Summer SchoolExeter, 2003
2
Overview
  • K-means clustering
  • Gaussian mixtures
  • Maximum likelihood and EM
  • Probabilistic graphical models
  • Latent variables EM revisited
  • Bayesian Mixtures of Gaussians
  • Variational Inference
  • VIBES

3
Old Faithful
4
Old Faithful Data Set
Time betweeneruptions (minutes)
Duration of eruption (minutes)
5
K-means Algorithm
  • Goal represent a data set in terms of K clusters
    each of which is summarized by a prototype
  • Initialize prototypes, then iterate between two
    phases
  • E-step assign each data point to nearest
    prototype
  • M-step update prototypes to be the cluster means
  • Simplest version is based on Euclidean distance
  • re-scale Old Faithful data

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
Responsibilities
  • Responsibilities assign data points to
    clusterssuch that
  • Example 5 data points and 3 clusters

16
K-means Cost Function
17
Minimizing the Cost Function
  • E-step minimize w.r.t.
  • assigns each data point to nearest prototype
  • M-step minimize w.r.t
  • gives
  • each prototype set to the mean of points in that
    cluster
  • Convergence guaranteed since there is a finite
    number of possible settings for the
    responsibilities

18
(No Transcript)
19
(No Transcript)
20
Limitations of K-means
  • Hard assignments of data points to clusters
    small shift of a data point can flip it to a
    different cluster
  • Not clear how to choose the value of K
  • Solution replace hard clustering of K-means
    with soft probabilistic assignments
  • Represents the probability distribution of the
    data as a Gaussian mixture model

21
The Gaussian Distribution
  • Multivariate Gaussian
  • Define precision to be the inverse of the
    covariance
  • In 1-dimension

22
Likelihood Function
  • Data set
  • Assume observed data points generated
    independently
  • Viewed as a function of the parameters, this is
    known as the likelihood function

23
Maximum Likelihood
  • Set the parameters by maximizing the likelihood
    function
  • Equivalently maximize the log likelihood

24
Maximum Likelihood Solution
  • Maximizing w.r.t. the mean gives the sample
    mean
  • Maximizing w.r.t covariance gives the sample
    covariance

25
Bias of Maximum Likelihood
  • Consider the expectations of the maximum
    likelihood estimates under the Gaussian
    distribution
  • The maximum likelihood solution systematically
    under-estimates the covariance
  • This is an example of over-fitting

26
Intuitive Explanation of Over-fitting
27
Unbiased Variance Estimate
  • Clearly we can remove the bias by usingsince
    this gives
  • Arises naturally in a Bayesian treatment (see
    later)
  • For an infinite data set the two expressions are
    equal

28
Gaussian Mixtures
  • Linear super-position of Gaussians
  • Normalization and positivity require
  • Can interpret the mixing coefficients as prior
    probabilities

29
Example Mixture of 3 Gaussians
30
Contours of Probability Distribution
31
Surface Plot
32
Sampling from the Gaussian
  • To generate a data point
  • first pick one of the components with probability
  • then draw a sample from that component
  • Repeat these two steps for each new data point

33
Synthetic Data Set
34
Fitting the Gaussian Mixture
  • We wish to invert this process given the data
    set, find the corresponding parameters
  • mixing coefficients
  • means
  • covariances
  • If we knew which component generated each data
    point, the maximum likelihood solution would
    involve fitting each component to the
    corresponding cluster
  • Problem the data set is unlabelled
  • We shall refer to the labels as latent ( hidden)
    variables

35
Synthetic Data Set Without Labels
36
Posterior Probabilities
  • We can think of the mixing coefficients as prior
    probabilities for the components
  • For a given value of we can evaluate the
    corresponding posterior probabilities, called
    responsibilities
  • These are given from Bayes theorem by

37
Posterior Probabilities (colour coded)
38
Posterior Probability Map
39
Maximum Likelihood for the GMM
  • The log likelihood function takes the form
  • Note sum over components appears inside the log
  • There is no closed form solution for maximum
    likelihood

40
Over-fitting in Gaussian Mixture Models
  • Singularities in likelihood function when a
    component collapses onto a data pointthen
    consider
  • Likelihood function gets larger as we add more
    components (and hence parameters) to the model
  • not clear how to choose the number K of
    components

41
Problems and Solutions
  • How to maximize the log likelihood
  • solved by expectation-maximization (EM) algorithm
  • How to avoid singularities in the likelihood
    function
  • solved by a Bayesian treatment
  • How to choose number K of components
  • also solved by a Bayesian treatment

42
EM Algorithm Informal Derivation
  • Let us proceed by simply differentiating the log
    likelihood
  • Setting derivative with respect to equal to
    zero givesgivingwhich is simply the
    weighted mean of the data

43
EM Algorithm Informal Derivation
  • Similarly for the covariances
  • For mixing coefficients use a Lagrange multiplier
    to give

44
EM Algorithm Informal Derivation
  • The solutions are not closed form since they are
    coupled
  • Suggests an iterative scheme for solving them
  • Make initial guesses for the parameters
  • Alternate between the following two stages
  • E-step evaluate responsibilities
  • M-step update parameters using ML results

45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
Digression Probabilistic Graphical Models
  • Graphical representation of a probabilistic model
  • Each variable corresponds to a node in the graph
  • Links in the graph denote relations between
    variables
  • Motivation
  • visualization of models and motivation for new
    models
  • graphical determination of conditional
    independence
  • complex calculations (inference) performed using
    graphical operations (e.g. forward-backward for
    HMM)
  • Here we consider directed graphs

52
Example 3 Variables
  • General distribution over 3 variables
  • Apply product rule of probability twice
  • Express as a directed graph

53
General Decomposition Formula
  • Joint distribution is product of conditionals,
    conditioned on parent nodes
  • Example

54
EM Latent Variable Viewpoint
  • Binary latent variables
    describing which component generated each data
    point
  • Conditional distribution of observed variable
  • Prior distribution of latent variables
  • Marginalizing over the latent variables we obtain

55
Expected Value of Latent Variable
  • From Bayes theorem

56
Graphical Representation of GMM
57
Complete and Incomplete Data
complete
incomplete
58
Graph for Complete-Data Model
59
Latent Variable View of EM
  • If we knew the values for the latent variables,
    we would maximize the complete-data log
    likelihoodwhich gives a trivial closed-form
    solution (fit each component to the corresponding
    set of data points)
  • We dont know the values of the latent variables
  • However, for given parameter values we can
    compute the expected values of the latent
    variables

60
Expected Complete-Data Log Likelihood
  • Suppose we make a guess for the parameter
    values (means, covariances and mixing
    coefficients)
  • Use these to evaluate the responsibilities
  • Consider expected complete-data log likelihood
    where responsibilities are computed using
  • We are implicitly filling in latent variables
    with best guess
  • Keeping the responsibilities fixed and maximizing
    with respect to the parameters give the previous
    results

61
K-means Revisited
  • Consider GMM with common covariances
  • Take limit
  • Responsibilities become binary
  • Expected complete-data log likelihood becomes

62
EM in General
  • Consider arbitrary distribution over the
    latent variables
  • The following decomposition always holdswhere

63
Decomposition
64
Optimizing the Bound
  • E-step maximize with respect to
  • equivalent to minimizing KL divergence
  • sets equal to the posterior distribution
  • M-step maximize bound with respect to
  • equivalent to maximizing expected complete-data
    log likelihood
  • Each EM cycle must increase incomplete-data
    likelihood unless already at a (local) maximum

65
E-step
66
M-step
67
Bayesian Inference
  • Include prior distributions over parameters
  • Advantages in using conjugate priors
  • Example consider a single Gaussian over one
    variable
  • assume variance is known and mean is unknown
  • likelihood function for the mean
  • Choose Gaussian prior for mean

68
Bayesian Inference for a Gaussian
  • Posterior (proportional to product of prior and
    likelihood) will then also be Gaussianwhere

69
Bayesian Inference for a Gaussian
70
Bayesian Mixture of Gaussians
  • Conjugate priors for the parameters
  • Dirichlet prior for mixing coefficients
  • Normal-Wishart prior for means and
    precisionswhere the Wishart distribution is
    given by

71
Graphical Representation
  • Parameters and latent variables appear on equal
    footing

72
Variational Inference
  • As with many Bayesian models, exact inference for
    the mixture of Gaussians is intractable
  • Approximate Bayesian inference traditionally
    based on Laplaces method (local Gaussian
    approximation to the posterior) or Markov chain
    Monte Carlo
  • Variational Inference is an alternative, broadly
    applicable deterministic approximation scheme

73
General View of Variational Inference
  • Consider again the previous decomposition, but
    where the posterior is over all latent variables
    and parameterswhere
  • Maximizing over would give the true
    posterior distribution but this is intractable
    by definition

74
Factorized Approximation
  • Goal choose a family of distributions which are
  • sufficiently flexible to give good posterior
    approximation
  • sufficiently simple to remain tractable
  • Here we consider factorized distributions
  • No further assumptions are required!
  • Optimal solution for one factor, keeping the
    remained fixed
  • Coupled solutions so initialize then cyclically
    update

75
Lower Bound
  • Can also be evaluated
  • Useful for maths/code verification
  • Also useful for model comparison

76
(No Transcript)
77
(No Transcript)
78
Illustration Univariate Gaussian
  • Likelihood function
  • Conjugate priors
  • Factorized variational distribution

79
Variational Posterior Distribution
  • where

80
Initial Configuration
81
After Updating
82
After Updating
83
Converged Solution
84
Exact Solution
  • For this very simple example there is an exact
    solution
  • Expected precision given by
  • Compare with earlier maximum likelihood solution

85
Variational Mixture of Gaussians
  • Assume factorized posterior distribution
  • Gives optimal solution in the formwhere
    is a Dirichlet, and is a
    Normal-Wishart

86
Sufficient Statistics
  • Small computational overhead compared to maximum
    likelihood EM

87
Variational Equations for GMM
88
Bound vs. K for Old Faithful Data
89
Bayesian Model Complexity
90
Sparse Bayes for Gaussian Mixture
  • Instead of comparing different values of K, start
    with a large value and prune out excess
    components
  • Treat mixing coefficients as parameters, and
    maximize marginal likelihood Corduneanu
    Bishop, AIStats 01
  • Gives simple re-estimation equations for the
    mixing coefficients interleave with variational
    updates

91
(No Transcript)
92
(No Transcript)
93
General Variational Framework
  • Currently for each new model
  • derive the variational update equations
  • write application-specific code to find the
    solution
  • Both stages are time consuming and error prone
  • Can we build a general-purpose inference engine
    which automates these procedures?

94
Lower Bound for GMM
95
VIBES
  • Variational Inference for Bayesian Networks
  • Bishop and Winn (1999)
  • A general inference engine using variational
    methods
  • Models specified graphically

96
Example Mixtures of Bayesian PCA
97
Solution
98
Local Computation in VIBES
  • A key observation is that in the general
    solutionthe update for a particular node (or
    group of nodes) depends only on other nodes in
    the Markov blanket
  • Permits a local object-oriented implementation

99
Shared Hyper-parameters
100
Take-home Messages
  • Bayesian mixture of Gaussians
  • no singularities
  • determines optimal number of components
  • Variational inference
  • effective solution for Bayesian GMM
  • optimizes rigorous bound
  • little computational overhead compared to EM
  • VIBES
  • rapid prototyping of probabilistic models
  • graphical specification

101
Viewgraphs, tutorials andpublications available
from
  • http//research.microsoft.com/cmbishop
Write a Comment
User Comments (0)
About PowerShow.com