Title: Maximum Likelihood
1Maximum Likelihood
2Much estimation theory is presented in a rather
ad hoc fashion. Minimising squared errors seems a
good idea but why not minimise the absolute error
or the cube of the absolute error? Â Â The answer
is that there is an underlying approach which
justifies a particular minimisation strategy
conditional on certain assumptions. Â Â This is
the maximum likelihood principle.
3The idea is to assume a particular model with
unknown parameters, we can then define the
probability of observing a given event
conditional on a particular set of parameters. We
have observed a set of outcomes in the real
world. It is then possible to choose a set of
parameters which are most likely to have produced
the observed results. Â This is maximum
likelihood. In most cases it is both consistent
and efficient. It provides a standard to compare
other estimation techniques.
4An example  Suppose we sample a set of goods for
quality and find 5 defective items in a sample of
10. What is our estimate of the proportion of bad
items in the whole population. Â Intuitively of
course it is 50. Formally in a sample of size n
the probability of finding B bad items is Â
Â
5If the true proportion0.1, P0.0015, if it
equals 0.2, P0.0254 etc, we could search for the
most likely value. Or we can solve the problem
analytically, Â Â
Â
6So the maximum likelihood estimate of the
population proportion of bad items is 0.5. Â This
basic procedure can be applied in many cases,
once we can define the probability density
function for a particular event we have a general
estimation strategy.
7A general Statement  Consider a sample (X1...Xn)
which is drawn from a probability distribution
P(XA) where A are parameters. If the Xs are
independent with probability density function
P(XiA) the joint probability of the whole set
is Â
 this may be maximised with respect to A to give
the maximum likelihood estimates.
8It is often convenient to work with the Log of
the likelihood function. Â
 the advantage of this approach is that it is
extremely general but if the model is
misspecified it may be particularly sensitive to
this misspecification. Â
9The Likelihood function for the general non
linear model  if Y is a vector of n endogenous
variables and
Then the likelihood function for one period is
10and dropping some constants and taking logs
if the covariance structure is constant and has
zero off diagonal elements this reduces to single
equation OLS
11Two important matrices  The efficient score
matrix
this is made up of the first derivatives at each
point in time. It is a measure of dispersion of
the maximum estimate.
12The information matrix (Hessian) This is defined
as
This is a measure of how pointy' the likelihood
function is. The variance of the parameters is
given either by the inverse Hessian or the outer
product of the score matrix
13The Cramer-Rao Lower Bound  This is an important
theorem which establishes the superiority of the
ML estimate over all others. The Cramer-Rao lower
bound is the smallest theoretical variance which
can be achieved. ML gives this so any other
estimation technique can at best only equal it.
this is the Cramer-Rao inequality.
14Concentrating the Likelihood function  suppose
we split the parameter vector into two sub
vectors
now suppose we knew 1 then sometimes we can
derive a formulae for the ML estimate of 2, eg
then we could write the LF as
this is the concentrated likelihood
function.  This process is often very useful in
practical estimation as it reduces the number of
parameters which need to be estimated.
15An example of concentrating the LF Â The
likelihood function for a standard single
variable normal non-linear model is
we can concentrate this with respect to the
variance as follows, the FOC for a maximum with
respect to the variance is
16which implies that
so the concentrated log likelihood becomes
17Prediction error decomposition We assumed that
the observations were independent in the
statements above. This will not generally be true
especially in the presence of lagged dependent
variables. However the prediction error
decomposition allows us to extend standard ML
procedures to dynamic models. From the basic
definition of conditional probability
this may be applied directly to the likelihood
function,
18The first term is the conditional probability of
Y given all past values. We can then condition
the second term and so on to give
that is, a series of one step ahead prediction
errors conditional on actual lagged Y.
19Testing hypothesis. If a restriction on a model
is acceptable this means that the reduction in
the likelihood value caused by imposing the
restriction is not significant'. This gives us
a very general basis for constructing hypothesis
tests but to implement the tests we need some
definite metric to judge the tests against, i.e.
what is significant.
20L
Lu
LR
21Consider how the likelihood function changes as
we move around the parameter space, we can
evaluate this by taking a Taylor series expansion
around the ML point
and of course
22So
it is possible to demonstrate that
where m is the number of restrictions, and so
23And so
this gives us a measure for judging the
significance of likelihood based tests.
24Three test procedures. To construct the basic
test we need an estimate of the likelihood value
at the unrestricted point and the restricted
point and we compare these two. There are three
ways of deriving this. The likelihood ratio
test we simply estimate the model twice, once
unrestricted and once restricted and compare the
two. The Wald test This estimates only the
unrestricted point and uses an estimate of the
second derivative to guess' at the restricted
point. Standard t' tests are a form of wald
test. The LaGrange multiplier test this
estimates only the restricted model and again
uses an estimate of the second derivatives to
guess at the restricted point.
25L
Lu
LR
If the likelihood function were quadratic then
LRLMW. In general however WgtLRgtLM
26A special form of the LM test The LM test can be
calculated in a particularly convenient way under
certain circumstances. The general form of the
LM test is
Now suppose
where we assume that the subset of parameters 1
is fixed according to a set of restrictions g0
(G is the derivative of this restriction).
27Now
and so the LM test becomes
28And
which may be interpreted as TR2 from a regression
of e on G This is used in many tests for serial
correlation heteroskedasticity functional form
etc. e is the actual errors from a restricted
model and G is the restrictions in the model.
29An Example Serial correlation Suppose
The restriction that may be tested as
an LM test as follows estimate the model without
serial correlation. save the residuals u. then
estimate the model
then TR2 from this regression is an LM(m) test
for serial correlation
30Quasi Maximum Likelihood ML rests on the
assumption that the errors follow a particular
distribution (OLS is only ML if the errors are
normal etc.) What happens if we make the wrong
assumption. White(1982) Econometrica, 50,1,pg1.
demonstrates that, under very broad assumptions
about the misspecification of the error process,
ML is still a consistent estimator. The
estimation is then referred to as Quasi Maximum
Likelihood. But the covariance matrix is no
longer the standard ML one instead it is given by
Generally we may construct valid Wald and LM
tests by using this corrected covariance matrix
but the LR test is invalid as it works directly
from the value of the likelihood function.
31Numerical optimisation In simple cases (e.g.
OLS) we can calculate the maximum likelihood
estimates analytically. But in many cases we
cannot, then we resort to numerical optimisation
of the likelihood function. This amounts to
hill climbing in parameter space. there are many
algorithms and many computer programmes implement
these for you. It is useful to understand the
broad steps of the procedure.
32- set an arbitrary initial set of parameters.
- 2. determine a direction of movement
- 3. determine a step length to move
- 4. examine some termination criteria and either
stop or go back to 2.
33L
Lu
34Important classes of maximisation
techniques. Gradient methods. These base the
direction of movement on the first derivatives of
the LF with respect to the parameters. Often the
step length is also determined by (an
approximation to) the second derivatives. So
These include, Newton, Quasi Newton, Scoring,
Steepest descent, Davidson Fletcher Powel, BHHH
etc.
35Derivative free techniques. These do not use
derivatives and so they are less efficient but
more robust to extreme non-linearitys. e.g.
Powell or non-linear Simplex. These techniques
can all be sensitive to starting values and
tuning' parameters.
36Some special LFs Qualitative response
models. These are where we have only partial
information (insects and poison) in one form or
another. We assume an underlying continuous
model,
but we only observe certain limited information,
eg z1 or 0 related to y
37then we can group the data into two groups and
form a likelihood function with the following form
where F is a particular density function eg. the
standard normal Cumulative function or perhaps
the logistic (logit model) function
38 ARCH and GARCH These are an important class of
models which have time varying variances Suppose
then the likelihood function for this model is
which is a specialisation of the general Normal
LF with a time varying variance.
39An alternative approach Method of moments A
widely used technique in estimation is the
Generalised Method of Moments (GMM), This is an
extension of the standard method of moments. The
idea here is that if we have random drawings from
an unknown probability distribution then the
sample statistics we calculate will converge in
probability to some constant. This constant will
be a function of the unknown parameters of the
distribution. If we want to estimate k of these
parameters,
we compute k statistics (or moments) whose
probability limits are known functions of the
parameters
40 These k moments are set equal to the function
which generates the moments and the function is
inverted.
41A simple example Suppose the first moment (the
mean) is generated by the following distribution,
. The observed moment from a
sample of n observations is
Hence
And
42Method of Moments Estimation (MM) This is a
direct extension of the method of moments into a
much more useful setting. The idea here is that
we have a model which implies certain things
about the distribution or covariances of the
variables and the errors. So we know what some
moments of the distribution should be. We then
invert the model to give us estimates of the
unknown parameters of the model which match the
theoretical moments for a given sample. So
suppose we have a model
where are k parameters. And we have k
conditions (or moments) which should be met by
the model.
43then we approximate E(g) with a sample measure
and invert g.
44Examples OLS In OLS estimation we make the
assumption that the regressors (Xs) are
orthogonal to the errors. Thus
The sample analogue for each xi is
and so
and so the method of moments estimator in this
case is the value of which simultaneously solves
these i equations. This will be identical to the
OLS estimate.
45Maximum likelihood as an MM estimator In maximum
likelihood we have a general likelihood function.
and this will be maximised when the following k
first order conditions are met.
This give rise to the following k sample
conditions
Simultaneously solving these equations for
gives the MM equivalent of maximum likelihood.
46Generalised Method of Moments (GMM) In the
previous conditions there are as many moments as
unknown parameters, so the parameters are
uniquely and exactly determined. If there were
less moment conditions we would not be able to
solve them for a unique set of parameters (the
model would be under identified). If there are
more moment conditions than parameters then all
the conditions can not be met at the same time,
the model is over identified and we have GMM
estimation. Basically, if we can not satisfy
all the conditions at the same time we have to
trade them of against each other. So we need to
make them all as close to zero as possible at the
same time. We need a criterion function to
minimise.
47Suppose we have k parameters but L moment
conditions Lgtk.
Then we need to make all L moments as small as
possible simultaneously. One way is a weighted
least squares criterion.
That is, the weighted squared sum of the
moments. This gives a consistent estimator for
any positive definite matrix A (not a function of
)
48The optimal A If any weighting matrix is
consistent they clearly can not all be equally
efficient so what is the optimal estimate of
A. Hansen(1982) established the basic
properties of the optimal A and how to construct
the covariance of the parameter estimates. The
optimal A is simply the covariance matrix of the
moment conditions. (just as in GLS) Thus
49The parameters which solve this criterion
function then have the following properties.
Where
where G is the matrix of derivatives of the
moments with respect to the parameters and
is the true moment value.
50Conclusion
- Both ML and GMM are very flexible estimation
strategies - They are equivalent ways of approaching the same
problem in many instances.