Estimating parameters from data - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Estimating parameters from data

Description:

How can I estimate model parameters from data? ... However it is a rubbish estimator. We also need to worry about the variance of an estimator ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 33
Provided by: mcv31
Category:

less

Transcript and Presenter's Notes

Title: Estimating parameters from data


1
Estimating parameters from data
  • Gil McVean, Department of Statistics
  • Thursday 13th February 2009

2
Questions to ask
  • How can I estimate model parameters from data?
  • What should I worry about when choosing between
    estimators?
  • Is there some optimal way of estimating
    parameters from data?
  • How can I compare different parameter values?
  • How should I make statements about certainty
    regarding estimates and hypotheses?

3
Motivating example I
  • I conduct an experiment where I measure the
    weight of 100 mice that were exposed to a normal
    diet and 50 mice exposed to a high-energy diet
  • I want to estimate the expected gain in weight
    due to the change in diet

Normal
High-calorie
4
Motivating example II
  • I observe the co-segregation of two traits (e.g.
    a visible trait and a genetic marker) in a cross
  • I want to estimate the recombination rate between
    the two markers

5
Parameter estimation
  • We can formulate most questions in statistics in
    terms of making statements about underlying
    parameters
  • We want to devise a framework for estimating
    those parameters and making statements about our
    certainty
  • In this lecture we will look at several different
    approaches to making such statements
  • Moment estimators
  • Likelihood
  • Bayesian estimation

6
Moment estimation
  • You have already come across one way of
    estimating parameter values moment methods
  • In such techniques parameter values are found
    that match sample moments (mean, variance, etc.)
    to those expected
  • E.g. for random variables X1, X2, etc. sampled
    from a N(m,s2) distribution

7
Example fitting a gamma distribution
  • The gamma distribution is parameterised by a
    shape parameter, a, and a scale parameter, b
  • The mean of the distribution is a/b and the
    variance is a/b2
  • We can fit a gamma distribution by looking at the
    first two sample moments

Alkaline phosphatase measurements in 2019 mice a
4.03 b 0.14
8
Bias
  • Although the moment method looks sensible, it can
    lead to biased estimators
  • In the previous example, estimates of both
    parameters are upwardly biased
  • Bias is measured by the difference between the
    expected estimate and the truth
  • However, bias is not the only thing to worry
    about
  • For example, the value of the first observation
    is an unbiased estimator of the mean for a Normal
    distribution. However it is a rubbish estimator
  • We also need to worry about the variance of an
    estimator

9
Example estimating the population mutation rate
  • In population genetics, a parameter of interest
    is the population-scaled mutation rate
  • There are two common estimators for this
    parameter
  • The average number of differences between two
    sequences
  • The total number of polymorphic sites in the
    sample divided by a constant that is
    approximately the log of the sample size
  • Which is better?
  • The first estimator has larger variance than the
    second suggesting that it is an inferior
    estimator
  • It is actually worse than this it is not even
    guaranteed to converge on the truth as the sample
    size gets infinitely large
  • A property called consistency

10
The bias-variance trade off
  • Some estimators may be biased
  • Some estimators may have large variance
  • Which is better?
  • A simple way of combining both metrics is to
    consider the mean-squared error of an estimator

11
Example
  • Consider two ways of estimating the variance of a
    Normal distribution from the sample variance
  • The second estimator is unbiased, but the first
    estimator has lower MSE
  • Actually, there is a third estimator, which is
    even more biased than the first, but which has
    even lower MSE

12
Least squares estimation
  • A commonly-used approach to fitting models to
    data is called least squares estimation
  • This attempts to minimise the sum of the squares
    of residuals
  • A residual is the difference between an observed
    and a fitted value
  • An important point to remember is that minimising
    LS is not the only thing to worry about when
    fitting model
  • Over-fitting

13
Problems with moment estimation
  • It is not always possible to exactly match sample
    moments with their expectation
  • It is not clear when using moment methods how
    much of the information in the data about the
    parameters is being used
  • Often not much..
  • Why should MSE be the best way of measuring the
    value of an estimator?

14
Is there an optimal way to estimate parameters?
  • For any model the maximum information about model
    parameters is obtained by considering the
    likelihood function
  • The likelihood function is proportional to the
    probability of observing the data given a
    specified parameter value
  • One natural choice for point estimation of
    parameters is the maximum likelihood estimate,
    the parameter values that maximise the
    probability of observing the data
  • The maximum likelihood estimate (mle) has some
    useful properties (though is not always optimal
    in every sense )

15
An intuitive view on likelihood
16
An example
  • Suppose we have data generated from a Poisson
    distribution. We want to estimate the parameter
    of the distribution
  • The probability of observing a particular random
    variable is
  • If we have observed a series of iid Poisson RVs
    we obtain the joint likelihood by multiplying the
    individual probabilities together

17
Comments
  • Note in the likelihood function the factorials
    have disappeared. This is because they provide a
    constant that does not influence the relative
    likelihood of different values of the parameter
  • It is usual to work with the log likelihood
    rather than the likelihood. Note that maximising
    the log likelihood is equivalent to maximising
    the likelihood
  • We can find the mle of the parameter analytically

Take the natural log of the likelihood function
Find where the derivative of the log likelihood
is zero
Note that here the mle is the same as the moment
estimator
18
Sufficient statistics
  • In this example we could write the likelihood as
    a function of a simple summary of the data the
    mean
  • This is an example of a sufficient statistic.
    These are statistics that contain all information
    about the parameter(s) under the specified model
  • For example, support we have a series of iid
    normal RVs

Mean square
Mean
19
Properties of the maximum likelihood estimate
  • The maximum likelihood estimate can be found
    either analytically or by numerical maximisation
  • The mle is consistent in that it converges to the
    truth as the sample size gets infinitely large
  • The mle is asymptotically efficient in that it
    achieves the minimum possible variance (the
    Cramér-Rao Lower Bound) as n?8
  • However, the mle is often biased for finite
    sample sizes
  • For example, the mle for the variance parameter
    in a normal distribution is the sample variance

20
Comparing parameter estimates
  • Obtaining a point estimate of a parameter is just
    one problem in statistical inference
  • We might also like to ask how good different
    parameter values are
  • One way of comparing parameters is through
    relative likelihood
  • For example, suppose we observe counts of 12, 22,
    14 and 8 from a Poisson process
  • The maximum likelihood estimate is 14. The
    relative likelihood is given by

21
Using relative likelihood
  • The relative likelihood and log likelihood
    surfaces are shown below

22
Interval estimation
  • In most cases the chance that the point estimate
    you obtain for a parameter is actually the
    correct one is zero
  • We can generalise the idea of point estimation to
    interval estimation
  • Here, rather than estimating a single value of a
    parameter we estimate a region of parameter space
  • We make the inference that the parameter of
    interest lies within the defined region
  • The coverage of an interval estimator is the
    fraction of times the parameter actually lies
    within the interval
  • The idea of interval estimation is intimately
    linked to the notion of confidence intervals

23
Example
  • Suppose Im interested in estimating the mean of
    a normal distribution with known variance of 1
    from a sample of 10 observations
  • I construct an interval estimator
  • The chart below shows how the coverage properties
    of this estimator vary with a

If I choose a to be 0.62 I would have coverage of
95
24
Confidence intervals
  • It is a short step from here to the notion of
    confidence intervals
  • We find an interval estimator of the parameter
    that, for any value of the parameter that might
    be possible, has the desired coverage properties
  • We then apply this interval estimator to our
    observed data to get a confidence interval
  • We can guarantee that among repeat performances
    of the same experiment the true value of the
    parameter would be in this interval 95 of the
    time
  • We cannot say There is a 95 chance of the true
    parameter being in this interval

25
Example confidence intervals for normal
distribution
  • Creating confidence intervals for the mean of
    normal distributions is relatively easy because
    the coverage properties of interval estimators do
    not depend on the mean (for a fixed variance)
  • For example, the interval estimator below has 95
    coverage properties for any mean
  • As youll see later, there is an intimate link
    between confidence intervals and hypothesis
    testing

26
Example confidence intervals for exponential
distribution
  • For most distributions, the coverage properties
    of an estimator will depend on the true
    underlying parameter
  • However, we can make use of the CLT to make
    confidence intervals for means
  • For example, for the exponential distribution
    with different means, the graph shows the
    coverage properties for the interval estimator
    (n100)

27
Confidence intervals and likelihood
  • Thanks to the CLT there is another useful result
    that allows us to define confidence intervals
    from the log-likelihood surface
  • Specifically, the set of parameter values for
    which the log-likelihood is not more than 1.92
    less than the maximum likelihood will define a
    95 confidence interval
  • In the limit of large sample size the LRT is
    approximately chi-squared distributed under the
    null
  • This is a very useful result, but shouldnt be
    assumed to hold
  • i.e. Check with simulation

28
Bayesian estimators
  • As you may notice, the notion of a confidence
    interval is very hard to grasp and has remarkably
    little connection to the data that you have
    collected
  • It seems much more natural to attempt to make
    statements about which parameter values are
    likely given the data you have collected
  • To put this on a rigorous probabilistic footing
    we want to make statements about the probability
    (density) of any particular parameter value given
    our data
  • We use Bayes theorem

Prior
Likelihood
Posterior
Normalising constant
29
Bayes estimators
  • The single most important conceptual difference
    between Bayesian statistics and frequentist
    statistics is the notion that the parameters you
    are interested in are themselves random variables
  • This notion is encapsulated in the use of a
    subjective prior for your parameters
  • Remember that to construct a confidence interval
    we have to define the set of possible parameter
    values
  • A prior does the same thing, but also gives a
    weight to different values

30
Example coin tossing
  • I toss a coin twice and observe two heads
  • I want to perform inference about the probability
    of obtaining a head on a single throw for the
    coin in question
  • The point estimate/MLE for the probability is 1.0
    yet I have a very strong prior belief that the
    answer is 0.5
  • Bayesian statistics forces the researcher to be
    explicit about prior beliefs but, in return, can
    be very specific about what information has been
    gained by performing the experiment

31
The posterior
  • Bayesian inference about parameters is contained
    in the posterior distribution
  • The posterior can be summarised in various ways

Posterior mean
Posterior
Prior
Credible Interval
32
Bayesian inference and the notion of shrinkage
  • The notion of shrinkage is that you can obtained
    better estimates by assuming a certain degree of
    similarity among the things you want to estimate
  • Practically, this means two things
  • Borrowing information across observations
  • Penalising inferences that are very different
    from anything else
  • The notion of shrinkage is implicit in the use of
    priors in Bayesian statistics
  • There are also forms of frequentist inference
    where shrinkage is used
  • But NOT MLE
Write a Comment
User Comments (0)
About PowerShow.com