Title: Missing Data and random effect modelling
1Lecture 20
- Missing Data and random effect modelling
2Lecture Contents
- What is missing data?
- Simple ad-hoc methods.
- Types of missing data. (MCAR, MAR, MNAR)
- Principled methods.
- Multiple imputation.
- Methods that respect the random effect structure.
- Thanks to James Carpenter (LSHTM) for many
slides!!
3Dealing with missing data
- Why is this necessary?
- Missing data are common.
- However, they are usually inadequately handled in
both epidemiological and experimental research. - For example, Wood et al. (2004) reviewed 71
recently published BMJ, JAMA, Lancet and NEJM
papers. - 89 had partly missing outcome data.
- In 37 trials with repeated outcome measures, 46
performed complete case analysis. - Only 21 reported sensitivity analysis.
4What do we mean by missing data?
- Missing data are observations that we intended to
be made but did not make. For example, an
individual may only respond to certain questions
in a survey, or may not respond at all to a
particular wave of a longitudinal survey. In the
presence of missing data, our goal remains making
inferences that apply to the population targeted
by the complete sample - i.e. the goal remains
what it was if we had seen the complete data. - However, both making inferences and performing
the analysis are now more complex. We will see we
need to make assumptions in order to draw
inferences, and then use an appropriate
computational approach for the analysis. - We will avoid adopting computationally simple
solutions (such as just analysing complete data
or carrying forward the last observation in a
longitudinal study) which generally lead to
misleading inferences.
5What are missing data?
- In practice the data consist of (a) the
observations actually made (where '?' denotes a
missing observation) - and (b) the pattern of missing values
Variable Variable Variable Variable Variable Variable Variable
Unit 1 2 3 4 5 6 7
1 1 2 3.4 4.5 ? 10 1.2
2 1 3 ? ? B 12 ?
3 2 ? 2.6 ? C 15 0
Variable Variable Variable Variable Variable Variable Variable
Unit 1 2 3 4 5 6 7
1 1 1 1 1 0 1 1
2 1 1 0 0 1 1 0
3 1 0 1 0 1 1 1
6Inferential Framework
- When it comes to analysis, whether we adopt a
frequentist or a Bayesian approach the likelihood
is central. - In these slides, for convenience, we discuss
issues from a frequentist perspective, although
often we use appropriate Bayesian computational
strategies to approximate frequentist analyses.
7Classical Approach
- The actual sampling process involves the
'selection' of the missing values, as well as the
units. So to complete the process of inference in
a justifiable way we need to take this into
account.
8Bayesian Framework
- Posterior Belief
- Prior Belief Likelihood.
- Here
- The likelihood is a measure of comparative
support for different models given the data. It
requires a model for the observed data, and as
with classical inference this must involve
aspects of the way in which the missing data have
been selected (i.e. the missingness mechanism).
9What do we mean by valid inference when we have
missing data?
- We have already noted that missing data are
observations we intended to make but did not.
Thus, the sampling process now involves both the
selection of the units, AND ALSO the process by
which observations become missing - the
missingness mechanism. - It follows that for valid inference, we need to
take account of the missingness mechanism. - By valid inference in a frequentist framework we
mean that the quantities we calculate from the
data have the usual properties. In other words,
estimators are consistent, confidence intervals
attain nominal coverage, p-values are correct
under the null hypothesis, and so on.
10Assumptions
- We distinguish between item and unit nonresponse
(missingness). For item missingness, values can
be missing on response (i.e. outcome) variables
and/or on explanatory (i.e. design/covariate/expos
ure/confounder) variables. - Missing data can effect properties of estimators
(for example, means, percentages, percentiles,
variances, ratios, regression parameters and so
on). Missing data can also affect inferences,
i.e. the properties of tests and confidence
intervals, and Bayesian posterior distributions. - A critical determinant of these effects is the
way in which the probability of an observation
being missing (the missingness mechanism) depends
on other variables (measured or not) and on its
own value. - In contrast with the sampling process, which is
usually known, the missingness mechanism is
usually unknown.
11Assumptions
- The data alone cannot usually definitively tell
us the sampling process. - Likewise, the missingness pattern, and its
relationship to the observations, cannot
definitively identify the missingness mechanism. - The additional assumptions needed to allow the
observed data to be the basis of inferences that
would have been available from the complete data
can usually be expressed in terms of either - 1. the relationship between selection of missing
observations and the values they would have
taken, or - 2. the statistical behaviour of the unseen data.
- These additional assumptions are not subject to
assessment from the data under analysis their
plausibility cannot be definitively determined
from the data at hand.
12Assumptions
- The issues surrounding the analysis of data sets
with missing values therefore centre on
assumptions. We have to - 1. decide which assumptions are reasonable and
sensible in any given setting -
contextual/subject matter information will be
central to this - 2. ensure that the assumptions are transparent
- 3. explore the sensitivity of inferences/conclusio
ns to the assumptions, and - 4. understand which assumptions are associated
with particular analyses.
13Getting computation out of the way
- The above implies it is sensible to use
approaches that make weak assumptions, and to
seek computational strategies to implement them.
However, often computationally simple strategies
are adopted, which make strong assumptions, which
are subsequently hard to justify. - Classic examples are completers analysis (i.e.
only including units with fully observed data in
the analysis) and last observation carried
forward. The latter is sometimes advocated in
longitudinal studies, and replaces a unit's
unseen observations at a particular wave with
their last observed values, irrespective of the
time that has elapsed between the two waves.
14Conclusions (1)
- Missing data introduce an element of ambiguity
into statistical analysis, which is different
from the traditional sampling imprecision. While
sampling imprecision can be reduced by increasing
the sample size, this will usually only increase
the number of missing observations! As discussed
in the preceding sections, the issues surrounding
the analysis of incomplete datasets turn out to
centre on assumptions and computation. - The assumptions concern the relationship between
the reason for the missing data (i.e. the
process, or mechanism, by which the data become
missing) and the observations themselves (both
observed and unobserved). - Unlike say in regression, where we can use the
residuals to check on the assumption of
normality, these assumptions cannot be verified
from the data at hand. - Sensitivity analysis, where we explore how our
conclusions change as we change the assumptions,
therefore has a central role in the analysis of
missing data.
15Simple, ad-hoc methods and their shortcomings
- In contrast to principled methods, these usually
create a single 'complete' dataset, which is
analysed as if it were the fully observed data. - Unless certain, fairly strong, assumptions are
true, the answers are invalid. - We briefly review the following methods
- Analysis of completers only.
- Imputation of simple mean.
- Imputation of regression mean.
- Creating an extra category.
16Completers analysis
- The data on the right has one missing observation
on variable 2, unit 10. - Completers analysis deletes all units with
incomplete data from the analysis (here unit 10).
Variable Variable
Unit 1 2
1 3.4 5.67
2 3.9 4.81
3 2.6 4.93
4 1.9 6.21
5 2.2 6.83
6 3.3 5.61
7 1.7 5.45
8 2.4 4.94
9 2.8 5.73
10 3.6 ?
17Whats wrong with completers analysis?
- It is inefficient.
- It is problematic in regression when covariate
values are missing and models with several sets
of explanatory variables need to be compared.
Either we keep changing the size of the data set,
as we add/remove explanatory variables with
missing observations, or we use the (potentially
very small, and unrepresentative) subset of the
data with no missing values. - When the missing observations are not a
completely random selection of the data, a
completers analysis will give biased estimates
and invalid inferences.
18Simple mean imputation
- We replace missing data with the arithmetic
average of the observed data for that variable.
In the table of 10 cases this will be 5.58. - Why not?
- This approach is clearly inappropriate for
categorical variables. - It does not lead to proper estimates of measures
of association or regression coefficients.
Rather, associations tend to be diluted. - In addition, variances will be wrongly estimated
(typically under estimated) if the imputed values
are treated as real. Thus inferences will be
wrong too.
19Regression mean imputation
- Here, we use the completers to calculate the
regression of the incomplete variable on the
other complete variables. Then, we substitute the
predicted mean for each unit with a missing
value. In this way we use information from the
joint distribution of the variables to make the
imputation. - To perform regression imputation, we first
regress variable 2 on variable 1 (note, it
doesn't matter which of these is the 'response'
in the model of interest). In our example, we use
simple linear regression - V2 a ß V1 e.
- Using units 1-9, we find that a 6.56 and ß -
0.366, so the regression relationship is - Expected value of V2 6.56 - 0.366V1.
- For unit 10, this gives
- 6.56 - 0.366 x 3.6 5.24.
20Regression mean imputation Why/Why Not?
- Regression mean imputation can generate unbiased
estimates of means, associations ad regression
coefficients in a much wider range of settings
than simple mean imputation. - However, one important problem remains. The
variability of the imputations is too small, so
the estimated precision of regression
coefficients will be wrong and inferences will be
misleading.
21Creating an extra category
- When a categorical variable has missing values it
is common practice to add an extra 'missing
value' category. In the example below, the
missing values, denoted '?' have been given the
category 3.
Variable Variable
Unit 1 2
1 3.4 1
2 3.9 1
3 2.6 1
4 1.9 1
5 2.2 ? ? 3
6 3.3 2
7 1.7 2
8 2.4 2
9 2.8 ? ? 3
10 3.6 ? ? 3
22Creating an extra category
- This is bad practice because
- the impact of this strategy depends on how
missing values are divided among the real
categories, and how the probability of a value
being missing depends on other variables - very dissimilar classes can be lumped into one
group - severe bias can arise, in any direction, and
- when used to stratify for adjustment (or correct
for confounding) the completed categorical
variable will not do its job properly.
23Some notation
- The data We denote the data we intended to
collect, by Y, and we partition this into - Y Yo,Ym.
- where Yo is observed and Ym is missing. Note that
some variables in Y may be outcomes/responses,
some may be explanatory variables/covariates.
Depending on the context these may all refer to
one unit, or to an entire dataset. - Missing value indicator Corresponding to every
observation Y, there is a missing value indicator
R, defined as - R 1 if Y observed 0 otherwise.
24Missing value mechanism
- The key question for analyses with missing data
is, under what circumstances, if any, do the
analyses we would perform if the data set were
fully observed lead to valid answers? As before,
'valid' means that effects and their SE's are
consistently estimated, tests have the correct
size, and so on, so inferences are correct. - The answer depends on the missing value
mechanism. - This is the probability that a set of values are
missing given the values taken by the observed
and missing observations, which we denote by - Pr(R Yo, Ym).
25Examples of missing value mechanisms
- 1. The chance of non-response to questions about
income usually depend on the person's income. - 2. Someone may not be at home for an interview
because they are at work. - 3. The chance of a subject leaving a clinical
trial may depend on their response to treatment. - 4. A subject may be removed from a trial if their
condition is insufficiently controlled.
26Missing Completely at Random (MCAR)
- Suppose the probability of an observation being
missing does not depend on observed or unobserved
measurements. In mathematical terms, we write
this as - Pr(R Yo, Ym) Pr(R)
- Then we say that the observation is Missing
Completely At Random, which is often abbreviated
to MCAR. Note that in a sample survey setting
MCAR is sometimes called uniform non-response. - If data are MCAR, then consistent results with
missing data can be obtained by performing the
analyses we would have used had there been no
missing data, although there will generally be
some loss of information. In practice this means
that, under MCAR, the analysis of only those
units with complete data gives valid inferences.
27Missing At Random (MAR)
- After considering MCAR, a second question
naturally arises. That is, what are the most
general conditions under which a valid analysis
can be done using only the observed data, and no
information about the missing value mechanism,
Pr(R Yo, Ym)? The answer to this is when, given
the observed data, the missingness mechanism does
not depend on the unobserved data.
Mathematically, - Pr(R Yo, Ym) Pr(R Yo).
- This is termed Missing At Random, abbreviated
MAR.
28Missing Not At Random (MNAR)
- When neither MCAR nor MAR hold, we say the data
are Missing Not At Random, abbreviated MNAR. In
the likelihood setting (see end of previous
section) the missingness mechanism is termed
non-ignorable. - What this means is
- Even accounting for all the available observed
information, the reason for observations being
missing still depends on the unseen observations
themselves. - To obtain valid inference, a joint model of both
Y and R is required (that is a joint model of the
data and the missingness mechanism).
29MNAR (continued)
- Unfortunately
- We cannot tell from the data at hand whether the
missing observations are MCAR, MNAR or MAR
(although we can distinguish between MCAR and
MAR). - In the MNAR setting it is very rare to know the
appropriate model for the missingness mechanism. - Hence the central role of sensitivity analysis
we must explore how our inferences vary under
assumptions of MAR, MNAR, and under various
models. Unfortunately, this is often easier said
than done, especially under the time and
budgetary constraints of many applied projects.
30Principled methods
- These all have the following in common
- No attempt is made to replace a missing value
directly. i.e. we do not pretend to 'know' the
missing values. - Rather available information (from the observed
data and other contextual considerations) is
combined with assumptions not dependent on the
observed data. - This is used to
- either generate statistical information about
each missing value, e.g. distributional
information given what we have observed, the
missing observation has a normal distribution
with mean a and variance b , where the
parameters can be estimated from the data. - and/or generate information about the missing
value mechanism.
31Principled methods
- The great range of ways in which these can be
done leads to the plethora of approaches to
missing values. Here are some broad classes of
approach - Wholly model based methods.
- Simple stochastic imputation.
- Multiple stochastic imputation.
- Weighted methods. (not covered here)
32Wholly model based methods
- A full statistical model is written down for the
complete data. - Analysis (whether frequentist or Bayesian) is
based on the likelihood. - Assumptions must be made about the missing data
mechanism - If it is assumed MCAR or MAR, no explicit model
is needed for it. - Otherwise this model must be included in the
overall formulation. - Such likelihood analyses requires some form of
integration (averaging) over the missing data.
Depending on the setting this can be done
implicitly or explicitly, directly or indirectly,
analytically or numerically. The statistical
information on the missing data is contained in
the model. Examples of this would be the use of
linear mixed models under MAR in SAS PROC MIXED
or MLwiN. - We will examine this in the practical.
33Simple stochastic imputation
- Instead of replacing a value with a mean, a
random draw is made from some suitable
distribution. - Provided the distribution is chosen
appropriately, consistent estimators can be
obtained from methods that would work with the
whole data set. - Very important in the large survey setting where
draws are made from units with complete data that
are 'similar' to the one with missing values
(donors). - There are many variations on this hot-deck
approach. - Implicitly they use non-parametric estimates of
the distribution of the missing data typically
need very large samples.
34Simple stochastic imputation
- Although the resulting estimators can behave
well, for precision (and inference) account must
be taken of the source of the imputations (i.e.
there is no 'extra' data). This implies that the
usual complete data estimators of precision can't
be used. Thus, for each particular class of
estimator (e.g. mean, ratio, percentile) each
type of imputation has an associated variance
estimator that may be design based (i.e. using
the sampling structure of the survey) or model
based, or model assisted (i.e. using some
additional modelling assumptions). These variance
estimators can be very complicated and are not
convenient for generalization.
35Multiple (stochastic) imputation
- This is very similar to the single stochastic
imputation method, except there are many ways in
which draws can be made (e.g. hot-deck
non-parametric, model based). The crucial
difference is that, instead of completing the
data once, the imputation process is repeated a
small number of times (typically 5-10). Provided
the draws are done properly, variance estimation
(and hence constructing valid inferences) is much
more straightforward. - The observed variability among the estimates from
each imputed data set is used in modifying the
complete data estimates of precision. In this
way, valid inferences are obtained under missing
at random.
36Why do multiple imputation?
- One of the main problems with the single
stochastic imputation methods is the need for
developing appropriate variance formulae for each
different setting. Multiple imputation attempts
to provide a procedure that can get the
appropriate measures of precision relatively
simply in (almost) any setting. - It was developed by Rubin is a survey setting
(where it feels very natural) but has more
recently been used more widely.
37Missing Data and Random effects models
- In the practical we will consider two approaches
- Model based MCMC estimation of a multivariate
response model. - Generating multiple imputations from this model
(using MCMC) that can then be used to fit further
models using any estimation method.
38Information on practical
- Practical introduces MVN models in MLwiN using
MCMC. - Two education datasets.
- Firstly two responses that are components within
GCSE science exams in which we consider model
based approaches. - Secondly a six responses dataset from Hungary in
which we consider multiple imputation.
39Other approaches to missing data
- IGLS estimation of MVN models is available in
MLwiN. Here the algorithm treats the MVN model as
a special case of a univariate Normal model and
so there are no overheads for missing data
(assuming MAR). - WinBUGS has great flexibility with missing data.
The MLwiN-gtWinBUGS interface will allow you to do
the same model based approach as in the
practical. - It can however also be used to incorporate
imputation models as part of the model.
40Plug for www.missingdata.org.uk
- James Carpenter has developed MLwiN macros that
perform multiple imputation using MCMC. - These build around the MCMC features in the
practical but run an imputation model independent
of the actual model of interest. - See www.missingdata.org.uk for further details
including variants of these slides and WinBUGS
practicals.