Maximum Likelihood

About This Presentation

Title:

Maximum Likelihood

Description:

Maximum Likelihood. Much estimation theory is presented in a rather ad hoc fashion. ... It is often convenient to work with the Log of the likelihood function. ... – PowerPoint PPT presentation

Number of Views:216

Avg rating:3.0/5.0

Slides: 51

Provided by: Sha196

Category:

more less

Transcript and Presenter's Notes

Title: Maximum Likelihood

1
Maximum Likelihood
2
Much estimation theory is presented in a rather
ad hoc fashion. Minimising squared errors seems a
good idea but why not minimise the absolute error
or the cube of the absolute error? The answer
is that there is an underlying approach which
justifies a particular minimisation strategy
conditional on certain assumptions. This is
the maximum likelihood principle.
3
The idea is to assume a particular model with
unknown parameters, we can then define the
probability of observing a given event
conditional on a particular set of parameters. We
have observed a set of outcomes in the real
world. It is then possible to choose a set of
parameters which are most likely to have produced
the observed results. This is maximum
likelihood. In most cases it is both consistent
and efficient. It provides a standard to compare
other estimation techniques.
4
An example Suppose we sample a set of goods for
quality and find 5 defective items in a sample of
10. What is our estimate of the proportion of bad
items in the whole population. Intuitively of
course it is 50. Formally in a sample of size n
the probability of finding B bad items is

5
If the true proportion0.1, P0.0015, if it
equals 0.2, P0.0254 etc, we could search for the
most likely value. Or we can solve the problem
analytically,

6
So the maximum likelihood estimate of the
population proportion of bad items is 0.5. This
basic procedure can be applied in many cases,
once we can define the probability density
function for a particular event we have a general
estimation strategy.
7
A general Statement Consider a sample (X1...Xn)
which is drawn from a probability distribution
P(XA) where A are parameters. If the Xs are
independent with probability density function
P(XiA) the joint probability of the whole set
is
this may be maximised with respect to A to give
the maximum likelihood estimates.
8
It is often convenient to work with the Log of
the likelihood function.
the advantage of this approach is that it is
extremely general but if the model is
misspecified it may be particularly sensitive to
this misspecification.
9
The Likelihood function for the general non
linear model if Y is a vector of n endogenous
variables and
Then the likelihood function for one period is
10
and dropping some constants and taking logs
if the covariance structure is constant and has
zero off diagonal elements this reduces to single
equation OLS
11
Two important matrices The efficient score
matrix
this is made up of the first derivatives at each
point in time. It is a measure of dispersion of
the maximum estimate.
12
The information matrix (Hessian) This is defined
as
This is a measure of how pointy' the likelihood
function is. The variance of the parameters is
given either by the inverse Hessian or the outer
product of the score matrix
13
The Cramer-Rao Lower Bound This is an important
theorem which establishes the superiority of the
ML estimate over all others. The Cramer-Rao lower
bound is the smallest theoretical variance which
can be achieved. ML gives this so any other
estimation technique can at best only equal it.
this is the Cramer-Rao inequality.
14
Concentrating the Likelihood function suppose
we split the parameter vector into two sub
vectors
now suppose we knew 1 then sometimes we can
derive a formulae for the ML estimate of 2, eg
then we could write the LF as
this is the concentrated likelihood
function. This process is often very useful in
practical estimation as it reduces the number of
parameters which need to be estimated.
15
An example of concentrating the LF The
likelihood function for a standard single
variable normal non-linear model is
we can concentrate this with respect to the
variance as follows, the FOC for a maximum with
respect to the variance is
16
which implies that
so the concentrated log likelihood becomes
17
Prediction error decomposition We assumed that
the observations were independent in the
statements above. This will not generally be true
especially in the presence of lagged dependent
variables. However the prediction error
decomposition allows us to extend standard ML
procedures to dynamic models. From the basic
definition of conditional probability
this may be applied directly to the likelihood
function,
18
The first term is the conditional probability of
Y given all past values. We can then condition
the second term and so on to give
that is, a series of one step ahead prediction
errors conditional on actual lagged Y.
19
Testing hypothesis. If a restriction on a model
is acceptable this means that the reduction in
the likelihood value caused by imposing the
restriction is not significant'. This gives us
a very general basis for constructing hypothesis
tests but to implement the tests we need some
definite metric to judge the tests against, i.e.
what is significant.
20
L
Lu
LR
21
Consider how the likelihood function changes as
we move around the parameter space, we can
evaluate this by taking a Taylor series expansion
around the ML point
and of course
22
So
it is possible to demonstrate that
where m is the number of restrictions, and so
23
And so
this gives us a measure for judging the
significance of likelihood based tests.
24
Three test procedures. To construct the basic
test we need an estimate of the likelihood value
at the unrestricted point and the restricted
point and we compare these two. There are three
ways of deriving this. The likelihood ratio
test we simply estimate the model twice, once
unrestricted and once restricted and compare the
two. The Wald test This estimates only the
unrestricted point and uses an estimate of the
second derivative to guess' at the restricted
point. Standard t' tests are a form of wald
test. The LaGrange multiplier test this
estimates only the restricted model and again
uses an estimate of the second derivatives to
guess at the restricted point.
25
L
Lu
LR
If the likelihood function were quadratic then
LRLMW. In general however WgtLRgtLM
26
A special form of the LM test The LM test can be
calculated in a particularly convenient way under
certain circumstances. The general form of the
LM test is
Now suppose
where we assume that the subset of parameters 1
is fixed according to a set of restrictions g0
(G is the derivative of this restriction).
27
Now
and so the LM test becomes
28
And
which may be interpreted as TR2 from a regression
of e on G This is used in many tests for serial
correlation heteroskedasticity functional form
etc. e is the actual errors from a restricted
model and G is the restrictions in the model.
29
An Example Serial correlation Suppose
The restriction that may be tested as
an LM test as follows estimate the model without
serial correlation. save the residuals u. then
estimate the model
then TR2 from this regression is an LM(m) test
for serial correlation
30
Quasi Maximum Likelihood ML rests on the
assumption that the errors follow a particular
distribution (OLS is only ML if the errors are
normal etc.) What happens if we make the wrong
assumption. White(1982) Econometrica, 50,1,pg1.
demonstrates that, under very broad assumptions
about the misspecification of the error process,
ML is still a consistent estimator. The
estimation is then referred to as Quasi Maximum
Likelihood. But the covariance matrix is no
longer the standard ML one instead it is given by
Generally we may construct valid Wald and LM
tests by using this corrected covariance matrix
but the LR test is invalid as it works directly
from the value of the likelihood function.
31
Numerical optimisation In simple cases (e.g.
OLS) we can calculate the maximum likelihood
estimates analytically. But in many cases we
cannot, then we resort to numerical optimisation
of the likelihood function. This amounts to
hill climbing in parameter space. there are many
algorithms and many computer programmes implement
these for you. It is useful to understand the
broad steps of the procedure.
32

set an arbitrary initial set of parameters.
2. determine a direction of movement
3. determine a step length to move
4. examine some termination criteria and either
stop or go back to 2.

33
L
Lu
34
Important classes of maximisation
techniques. Gradient methods. These base the
direction of movement on the first derivatives of
the LF with respect to the parameters. Often the
step length is also determined by (an
approximation to) the second derivatives. So
These include, Newton, Quasi Newton, Scoring,
Steepest descent, Davidson Fletcher Powel, BHHH
etc.
35
Derivative free techniques. These do not use
derivatives and so they are less efficient but
more robust to extreme non-linearitys. e.g.
Powell or non-linear Simplex. These techniques
can all be sensitive to starting values and
tuning' parameters.
36
Some special LFs Qualitative response
models. These are where we have only partial
information (insects and poison) in one form or
another. We assume an underlying continuous
model,
but we only observe certain limited information,
eg z1 or 0 related to y
37
then we can group the data into two groups and
form a likelihood function with the following form
where F is a particular density function eg. the
standard normal Cumulative function or perhaps
the logistic (logit model) function
38
ARCH and GARCH These are an important class of
models which have time varying variances Suppose
then the likelihood function for this model is
which is a specialisation of the general Normal
LF with a time varying variance.
39
An alternative approach Method of moments A
widely used technique in estimation is the
Generalised Method of Moments (GMM), This is an
extension of the standard method of moments. The
idea here is that if we have random drawings from
an unknown probability distribution then the
sample statistics we calculate will converge in
probability to some constant. This constant will
be a function of the unknown parameters of the
distribution. If we want to estimate k of these
parameters,
we compute k statistics (or moments) whose
probability limits are known functions of the
parameters
40
These k moments are set equal to the function
which generates the moments and the function is
inverted.
41
A simple example Suppose the first moment (the
mean) is generated by the following distribution,
. The observed moment from a
sample of n observations is
Hence
And
42
Method of Moments Estimation (MM) This is a
direct extension of the method of moments into a
much more useful setting. The idea here is that
we have a model which implies certain things
about the distribution or covariances of the
variables and the errors. So we know what some
moments of the distribution should be. We then
invert the model to give us estimates of the
unknown parameters of the model which match the
theoretical moments for a given sample. So
suppose we have a model

where are k parameters. And we have k
conditions (or moments) which should be met by
the model.
43
then we approximate E(g) with a sample measure
and invert g.
44
Examples OLS In OLS estimation we make the
assumption that the regressors (Xs) are
orthogonal to the errors. Thus
The sample analogue for each xi is
and so
and so the method of moments estimator in this
case is the value of which simultaneously solves
these i equations. This will be identical to the
OLS estimate.
45
Maximum likelihood as an MM estimator In maximum
likelihood we have a general likelihood function.
and this will be maximised when the following k
first order conditions are met.
This give rise to the following k sample
conditions
Simultaneously solving these equations for
gives the MM equivalent of maximum likelihood.
46
Generalised Method of Moments (GMM) In the
previous conditions there are as many moments as
unknown parameters, so the parameters are
uniquely and exactly determined. If there were
less moment conditions we would not be able to
solve them for a unique set of parameters (the
model would be under identified). If there are
more moment conditions than parameters then all
the conditions can not be met at the same time,
the model is over identified and we have GMM
estimation. Basically, if we can not satisfy
all the conditions at the same time we have to
trade them of against each other. So we need to
make them all as close to zero as possible at the
same time. We need a criterion function to
minimise.
47
Suppose we have k parameters but L moment
conditions Lgtk.
Then we need to make all L moments as small as
possible simultaneously. One way is a weighted
least squares criterion.
That is, the weighted squared sum of the
moments. This gives a consistent estimator for
any positive definite matrix A (not a function of
)
48
The optimal A If any weighting matrix is
consistent they clearly can not all be equally
efficient so what is the optimal estimate of
A. Hansen(1982) established the basic
properties of the optimal A and how to construct
the covariance of the parameter estimates. The
optimal A is simply the covariance matrix of the
moment conditions. (just as in GLS) Thus
49
The parameters which solve this criterion
function then have the following properties.
Where
where G is the matrix of derivatives of the
moments with respect to the parameters and
is the true moment value.
50
Conclusion