580.691 Learning Theory - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

580.691 Learning Theory

Description:

In the last lecture we assumed that in the M step, we knew the posterior ... might be a simple linear model, but conditioned on z, where z is a multi-nomial. ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 26
Provided by: rezash
Category:

less

Transcript and Presenter's Notes

Title: 580.691 Learning Theory


1
580.691 Learning Theory Reza Shadmehr EM and
expected complete log-likelihood Mixture of
Experts Identification of a linear dynamical
system
2
The log likelihood of the unlabeled data
Hidden variable
Measured variable
The unlabeled data
In the last lecture we assumed that in the M
step, we knew the posterior probabilities, and
found the derivative of the log-likelihood with
respect to mu and sigma to maximize the
log-likelihood. Today we take a more general
approach to include both the E and M steps into
the log-likelihood.
3
A more general formulation of EM Expected
complete log likelihood
The real data is not labeled. But for now,
assume that someone labeled it, resulting in the
complete data.
Complete log-likelihood
Expected complete log-likelihood
In EM, in the E step we fix theta and try to
maximize the expected complete log-likelihood by
setting expected value of our hidden variables z
to the posterior probability. In the M step, we
fix expected value of z and try to maximize the
expected complete log-likelihood by setting the
parameters theta.
4
A more general formulation of EM Expected
complete log likelihood
In the M step, we fix expected value of z and try
to maximize the expected complete log-likelihood
by setting parameters theta.
Expected complete log-likelihood
5
(No Transcript)
6
Function to maximize
The value pi that maximizes this function is
one. But thats not interesting because we also
have another constraint The sum of priors
should be one. So we want to maximize this
function given the constraint that the sum of
pi_i is 1.
constraint
7
Function to maximize
Function to minimize
constraint
We have 3 such equations, one for each pi. If we
add the equations together, we get
8
EM algorithm Summary
We begin with a guess about the mixture
parameters The E step Calculate the
expected complete log-likelihood. In the mixture
example, this reduces to just computing the
posterior probabilities
The M step maximize the expected complete
log-likelihood with respect to the model
parameters theta
9
Selecting number of mixture components
A simple idea that helps with selection of number
of mixture components is to form a cost that
depends on both the log-likelihood of the data
and the number of parameters used in the model.
As the number of parameters increases, the
log-likelihood increases. We want a cost that
balances the change in the log-likelihood with
the cost of having increasing parameters. A
common technique is to find the m mixture
components that minimize the description length.
The effective number of parameters in the model
Minimize the description length
Maximum likelihood estimate of parameters for m
mixture components
Number of data points
10
Mixture of Experts
The data set (x,y) is clearly non-linear, but we
can break it up into two linear problems. We
will try to switch from one expert to the
another at around x0.
Expert 2
Expert 1

Moderator
Conditional probability of choosing expert 2
Expert 1
Expert 2
11
We have observed a sequence of data points (x,y),
and believe that it was generated by a process
shown to the right Note that y depends on both
x (which we can measure) and z, which is hidden
from us. For example, the dependence of y on x
might be a simple linear model, but conditioned
on z, where z is a multi-nomial.
The Moderator (gating network)
When there are only two experts, the moderator
can be a logistic function
When there are multiple experts, the moderator
can be a soft-max function
12
Based on our hypothesis, we should have the
following distribution of observed data
A key quantity is the posterior probability of
the latent variable z
Parameters of the moderator
Parameters of the expert
Posterior probability that the observed y
belongs to the i-th expert.
Note that the posterior probability for the i-th
expert is updated based on how probable the
observed data y was for this expert. In a way,
the expression tells us that given the observed
data y, how strongly should we assign it to
expert i.
13
Output of the i-th expert
Output of the moderator
Parameters of the i-th expert
Output of the whole network
Suppose there are two experts (m2). For a given
value of x, the two regressions each give us a
Gaussian distribution at their mean. Therefore,
for each value of x, we have a bimodal
probability distribution for y. We have a
mixture distribution in the output space y for
each input value of x.
The log-likelihood of the observed data.
14
The complete log-likelihood for the mixture of
experts problem
The completed data
Complete log-likelihood
Expected complete log-likelihood (assuming that
someone had given us theta)
15
The E step for the mixture of experts problem
In the E step, we begin by assuming that we have
theta. To compute the expected complete
log-likelihood, all we need are the posterior
probabilities.
The posterior for each expert depends on the
likelihood that the observed data y came from
that expert.
16
The M step for the mixture of experts problem
the moderator
Exactly the same as the IRLS cost function. We
find first and second derivatives and find a
learning rule
The moderator learns from the posterior
probability.
17
The M step for the mixture of experts problem
weights of the expert
A weighted least-squares problem
The expert i learns from the observed data point
y, weighted by the posterior probability that the
error came from that expert.
18
The M step for the mixture of experts problem
variance of the expert
19
Parameter Estimation for Linear Dynamical Systems
using EM
Objective to find the parameters A, B, C, Q,
and R of a linear dynamical system from a set of
data that includes inputs u and outputs y.
Need to find the expected complete log-likelihood
20
(No Transcript)
21
(No Transcript)
22
Posterior estimate of state and variance
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com