Title:
1Counting chickens and other tales Using
random effect models and MCMC estimation in
applied statistics research.
- Dr William J. Browne
- School of Mathematical Sciences
- University of Nottingham
2Outline
- Background to my research, random effect and
multilevel models and MCMC estimation. - Random effect models for complex data structures
including artificial insemination and Danish
chicken examples. - Multivariate random effect models and great tit
nesting behaviour. - Sample size calculations for complex random
effect models. - Multilevel modelling of mastitis incidence.
- Other research and future work.
3Background
- 1995-1998 PhD in Statistics, University of
Bath. - Applying MCMC methods to multilevel models.
- 1998-2003 Postdoctoral research positions at
the Centre for Multilevel Modelling at the
Institute of Education, London. - 2003-2006 Lecturer in Statistics at University
of Nottingham. - 2006- Associate professor of Statistics at
University of Nottingham. - Research interests
- Multilevel modelling, complex random effect
modelling, applied statistics, Bayesian
statistics and MCMC estimation.
4Random effect models
- Models that account for the underlying structure
in the dataset. - Originally developed for nested structures
(multilevel models), for example in education,
pupils nested within schools. - An extension of linear modelling with the
inclusion of random effects. - A typical 2-level model is
- Here i might index pupils and j index schools.
- Alternatively in another example i might index
cows and j index herds. - The important thing is that the model and
statistical methods used are the same!
5Estimation Methods for Multilevel Models
- Due to additional random effects no simple matrix
formulae exist for finding estimates in
multilevel models. - Two alternative approaches exist
- Iterative algorithms e.g. IGLS, RIGLS, that
alternate between estimating fixed and random
effects until convergence. Can produce ML and
REML estimates. - Simulation-based Bayesian methods e.g. MCMC that
attempt to draw samples from the posterior
distribution of the model. - One possible computer program to use for
multilevel models which incorporates both
approaches is MLwiN.
6MLwiN
- Software package designed specifically for
fitting multilevel models. - Developed by a team led by Harvey Goldstein and
Jon Rasbash at the Institute of Education in
London over past 15 years or so. Earlier
incarnations ML2, ML3, MLN. - Originally contained classical estimation
methods (IGLS) for fitting models. - MLwiN launched in 1998 also included MCMC
estimation. - My role in the team was as developer of the MCMC
functionality in MLwiN in my time at Bath and
during 4.5 years at the IOE. - Note MLwiN core team relocated to Bristol in
2005.
7MCMC Algorithm
- Consider the 2-level normal response model
- MCMC algorithms usually work in a Bayesian
framework and so we need to add prior
distributions for the unknown parameters. - Here there are 4 sets of unknown parameters
- We will add prior distributions
8MCMC Algorithm (2)
- One possible MCMC algorithm for this model then
involves simulating in turn from the 4 sets of
conditional distributions. Such an algorithm is
known as Gibbs Sampling. MLwiN uses Gibbs
sampling for all normal response models. - Firstly we set starting values for each group of
unknown parameters, - Then sample from the following conditional
distributions, firstly - To get .
9MCMC Algorithm (3)
- We next sample from
- to get , then
- to get , then finally
- To get . We have then updated all of the
unknowns in the model. The process is then simply
repeated many times, each time using the
previously generated parameter values to generate
the next set
10Burn-in and estimates
- Burn-in It is general practice to throw away the
first n values to allow the Markov chain to
approach its equilibrium distribution namely the
joint posterior distribution of interest. These
iterations are known as the burn-in. - Finding Estimates We continue generating values
at the end of the burn-in for another m
iterations. These m values are then averaged to
give point estimates of the parameter of
interest. Posterior standard deviations and other
summary measures can also be obtained from the
chains.
11So why use MCMC?
- Often gives better (in terms of bias) estimates
for non-normal responses (see Browne and Draper,
2006). - Gives full posterior distribution so interval
estimates for derived quantities are easy to
produce. - Can easily be extended to more complex problems
as we will see next. - Potential downside 1 Prior distributions
required for all unknown parameters. - Potential downside 2 MCMC estimation is much
slower than the IGLS algorithm. - For more information see my book MCMC Estimation
in MLwiN Browne (2003).
12Extension 1 Cross-classified models
For example, schools by neighbourhoods. Schools
will draw pupils from many different
neighbourhoods and the pupils of a neighbourhood
will go to several schools. No pure hierarchy can
be found and pupils are said to be contained
within a cross-classification of schools by
neighbourhoods
nbhd 1 nbhd 2 Nbhd 3
School 1 xx x
School 2 x x
School 3 xx x
School 4 x xxx
13Notation
With hierarchical models we use a subscript
notation that has one subscript per level and
nesting is implied reading from the left. For
example, subscript pattern ijk denotes the ith
level 1 unit within the jth level 2 unit within
the kth level 3 unit. If models become
cross-classified we use the term classification
instead of level. With notation that has one
subscript per classification, that also captures
the relationship between classifications,
notation can become very cumbersome. We propose
an alternative notation introduced in Browne et
al. (2001) that only has a single subscript no
matter how many classifications are in the model.
14Single subscript notation
We write the model as
where classification 2 is neighbourhood and
classification 3 is school. Classification 1
always corresponds to the classification at which
the response measurements are made, in this case
pupils. For pupils 1 and 11 equation (1) becomes
15Classification diagrams
In the single subscript notation we lose
information about the relationship (crossed or
nested) between classifications. A useful way of
conveying this information is with the
classification diagram. Which has one node per
classification and nodes linked by arrows have a
nested relationship and unlinked nodes have a
crossed relationship.
School
Neighbourhood
Pupil
Cross-classified structure where pupils from a
school come from many neighbourhoods and pupils
from a neighbourhood attend several schools.
Nested structure where schools are contained
within neighbourhoods
16Example Artificial insemination by donor
1901 women 279 donors 1328 donations 12100
ovulatory cycles response is whether conception
occurs in a given cycle
In terms of a unit diagram
Or a classification diagram
17Model for artificial insemination data
We can write the model as
Results
Note cross-classified models can be fitted in
IGLS but are far easier to fit using MCMC
estimation.
18Extension 2 Multiple membership models
- When level 1 units are members of more than one
higher level unit we describe a model for such
data as a multiple membership model. - For example,
- Pupils change schools/classes and each
school/class has an effect on pupil outcomes. - Patients are seen by more than one nurse during
the course of their treatment.
19Notation
Note that nurse(i) now indexes the set of nurses
that treat patient i and w(2)i,j is a weighting
factor relating patient i to nurse j. For
example, with four patients and three nurses, we
may have the following weights
20Classification diagrams for multiple membership
relationships
Double arrows indicate a multiple membership
relationship between classifications.
We can mix multiple membership, crossed and
hierarchical structures in a single model.
21Example involving nesting, crossing and multiple
membership Danish chickens
Production hierarchy 10,127 child flocks
725
houses 304 farms
Breeding hierarchy 10,127 child flocks 200 parent
flocks
As a unit diagram
As a classification diagram
22Model and results
Response is cases of salmonella Note multiple
membership models can be fitted in IGLS and this
model/dataset represents roughly the most complex
model that the method can handle. Such models are
far easier to fit using MCMC estimation.
23Random effect modelling of great tit nesting
behaviour
- An extension of cross-classified models to
multivariate responses. - Collaborative research with Richard Pettifor
(Institute of Zoology, London), and Robin
McCleery and Ben Sheldon (University of Oxford).
24Wytham woods great tit dataset
- A longitudinal study of great tits nesting in
Wytham Woods, Oxfordshire. - 6 responses 3 continuous 3 binary.
- Clutch size, lay date and mean nestling mass.
- Nest success, male and female survival.
- Data 4165 nesting attempts over a period of 34
years. - There are 4 higher-level classifications of the
data female parent, male parent, nestbox and
year.
25Data background
The data structure can be summarised as follows
Note there is very little information on each
individual male and female bird but we can get
some estimates of variability via a random
effects model.
26Diagrammatic representation of the dataset.
27Univariate cross-classified random effect
modelling
- For each of the 6 responses we will firstly fit a
univariate model, normal responses for the
continuous variables and probit regression for
the binary variables. For example using notation
of Browne et al. (2001) and letting response yi
be clutch size
28Estimation
- We use MCMC estimation in MLwiN and choose
diffuse priors for all parameters. - We run 3 MCMC chains from different starting
points for 250k iterations each (500k for binary
responses) and use the Gelman-Rubin diagnostic to
decide burn-in length. - We compared results with the equivalent classical
model using the Genstat software package and got
broadly similar results. - We fit all four higher classifications and do not
consider model comparison.
29Clutch Size
Here we see that the average clutch size is just
below 9 eggs with large variability between
female birds and some variability between years.
Male birds and nest boxes have less impact.
30Lay Date (days after April 1st)
Here we see that the mean lay date is around the
end of April/beginning of May. The biggest driver
of lay date is the year which is probably
indicating weather differences. There is some
variability due to female birds but little impact
of nest box and male bird.
31Nestling Mass
Here the response is the average mass of the
chicks in a brood at 10 days old. Note here lots
of the variability is unexplained and both
parents are equally important.
32Human example
Helena Jayne Browne Born 22nd May 2006 Birth
Weight 8lb 0oz
Sarah Victoria Browne Born 20th July 2004 Birth
Weight 6lb 6oz
Fathers birth weight 9lb 13oz, Mothers birth
weight 6lb 8oz
33Nest Success
Here we define nest success as one of the ringed
nestlings captured in later years. The value 0.01
for ß corresponds to around a 50 success rate.
Most of the variability is explained by the
Binomial assumption with the bulk of the
over-dispersion mainly due to yearly differences.
34Male Survival
Here male survival is defined as being observed
breeding in later years. The average probability
is 0.334 and there is very little over-dispersion
with differences between years being the main
factor. Note the actual response is being
observed breeding in later years and so the real
probability is higher!
35Female survival
Here female survival is defined as being observed
breeding in later years. The average probability
is 0.381 and again there isnt much
over-dispersion with differences between
nestboxes and years being the main factors.
36Multivariate modelling of the great tit dataset
- We now wish to combine the six univariate models
into one big model that will also account for the
correlations between the responses. - We choose a MV Normal model and use latent
variables (Chib and Greenburg, 1998) for the 3
binary responses that take positive values if the
response is 1 and negative values if the response
is 0. - We are then left with a 6-vector for each
observation consisting of the 3 continuous
responses and 3 latent variables. The latent
variables are estimated as an additional step in
the MCMC algorithm and for identifiability the
elements of the level 1 variance matrix that
correspond to their variances are constrained to
equal 1.
37Multivariate Model
Here the vector valued response is decomposed
into a mean vector plus random effects for each
classification.
Inverse Wishart priors are used for each of the
classification variance matrices. The values are
based on considering overall variability in each
response and assuming an equal split for the 5
classifications.
38Use of the multivariate model
- The multivariate model was fitted using an MCMC
algorithm programmed into the MLwiN package which
consists of Gibbs sampling steps for all
parameters apart from the level 1 variance matrix
which requires Metropolis sampling (see Browne
2006). - The multivariate model will give variance
estimates in line with the 6 univariate models. - In addition the covariances/correlations at each
level can be assessed to look at how correlations
are partitioned.
39Partitioning of covariances
40Correlations from a 1-level model
- If we ignore the structure of the data and
consider it as 4165 independent observations we
get the following correlations
CS LD NM NS MS
LD -0.30 X X X X
NM -0.09 -0.06 X X X
NS 0.20 -0.22 0.16 X X
MS 0.02 -0.02 0.04 0.07 X
FS -0.02 -0.02 0.06 0.11 0.21
Note correlations in bold are statistically
significant i.e. 95 credible interval doesnt
contain 0.
41Correlations in full model
CS LD NM NS MS
LD N, F, O -0.30 X X X X
NM F, O -0.09 F, O -0.06 X X X
NS Y, F 0.20 N, F, O -0.22 O 0.16 X X
MS - 0.02 - -0.02 - 0.04 Y 0.07 X
FS F, O -0.02 F, O -0.02 - 0.06 Y, F 0.11 Y, O 0.21
Key Blue ve, Red ve Y year, N nestbox, F
female, O - observation
42Pairs of antagonistic covariances at different
classifications
- There are 3 pairs of antagonistic correlations
i.e. correlations with different signs at
different classifications - LD NM Female 0.20 Observation -0.19
- Interpretation Females who generally lay late,
lay heavier chicks but the later a particular
bird lays the lighter its chicks. - CS FS Female 0.48 Observation -0.20
- Interpretation Birds that lay larger clutches
are more likely to survive but a particular bird
has less chance of surviving if it lays more
eggs. - LD FS Female -0.67 Observation 0.11
- Interpretation Birds that lay early are more
likely to survive but for a particular bird the
later they lay the better!
43Sample size calculations in random effect models
- A current ESRC grant (2006-2009) that funds a
postdoc (Mousa Golalizadeh). - The grant will consider the problem of deciding
on how much data to collect for a research
question taking account of the likely structure
in the collected data (See later slides). - The grant will also investigate how various MCMC
algorithm developments perform in practice when
applied to real datasets. - Finally we will investigate when complex models
are identifiable in the presence of sparse data.
44Background
- Many quantitative research questions are of the
form of a hypothesis A has a significant effect
on B. - To answer such a question data is collected that
allows the researcher to (hopefully) test whether
statistically A has a significant effect on B.
(In fact we aim to reject the hypothesis that A
doesnt significantly affect B). - A test is performed and either the researcher is
happy and A indeed has a significant effect on B
or is left wondering why the data collected do
not back up their hypothesis. Is the hypothesis
false or was the data not sufficient? - The sufficiency of the data is the motivation for
sample size calculations.
45Example
- Suppose I have the research question Are
Welshmen on average taller than 175 cms? - I now need to get hold of a random sample of n
Welshmen and measure each of their heights. - I make some statistical assumption about the
distribution of the heights of Welshmen e.g. that
they come from a Normal distribution. - I might like to check this assumption by plotting
a histogram of the data. - I can then form a statistical hypothesis test and
test whether indeed Welshmen are taller than
175cms. - I need to decide how big to make n, my sample of
Welshmen.
46Hypothesis Testing
- Let us assume our null hypothesis is that the
average height of Welshmen (µ) is 175cm. - So we test H0µ175 vs HAµgt175 (or alternatively
H0?0 vs HA?gt0 where ?µ-175) - In practice we calculate from our sample its mean
( ) and standard deviation (s2) and use these
along with n to form a test statistic which we
can compare with the distribution assumed under
H0
47Type I and Type II errors
- No hypothesis test is perfect and there is always
the possibility of errors - P(Type I error) a significance level or size
- P(Type II error) ß, 1-ß is the power of the
test. - In general we fix a to some value e.g. 0.05, 0.01
then 1-ß depends on our sample size.
Truth Truth
H0 True H0 False
Decision Reject H0 Type I error Correct
Decision Accept H0 Correct Type II error
48Example hypothesis test
- Let us assume that in reality our sample mean is
180cms and the population standard deviation (sd)
is 5cms (known). - We can then form a test statistic as follows
- Note here that for small n and unknown sd we
should use a student-t distribution rather than
Normal. - For a 1-sided Z test we wish Z gt 1.645 and so
we need our sample to be of size 3 to reject H0,
using a student-t distribution increases this to
5. (Here a0.05) - However if the sample mean had been only 176cms
then we would need n gt (1.6455)2 68 Welshmen
to reject H0
49Power calculations
- Our last slide in some sense is backwards as we
cannot get from a given sample mean to choosing a
sample size! - What we do instead is use different terminology
and play God! - We will choose an effect size, ? which will
represent a guess at the increase in the sample
mean for Welshmen. - There then exists an (approximate) formula that
links four quantities, size (a), power (1-ß),
effect size (?) and sample size (n) - Note that the standard error (SE) of ? is a
function of n and s the population sd which is
assumed known. - We can now evaluate one of these quantities
conditional on the others e.g. what sample size
is required given a,1-ß and ??
Here RHS is sum of cases H0 true and H0 false.
50Welsh height example
- Here we have looked at two examples with effect
sizes 5 and 1 respectively. Assume s takes the
value 5 and so let us suppose we take a sample of
size 25 Welshmen. - Then
- Case 1 5/(5/v25)1.645z1-ß,z1-ß3.355
- ß0.9996
- Case 2 1/(5/ v25)1.645z1-ß,z1-ß-0.645
- ß0.25946
- So here a sample of 25 Welshmen from a population
with mean 180cms would almost always result in
rejecting H0, - but if the population mean is 176cms then only
26 of such samples would be rejected. - We can plot curves of how power increases with
sample size as shown in the next slide.
51Power curve for Welshmen example
- Here we see the two power curves for the two
scenarios
52Extending the idea
- The simple formula
can - be used in many situations and hypothesis tests.
- To generalise the idea we assume that ? is an
effect size associated with a statistic that we
wish to compare with a (null) hypothesized value
of 0. - The complication occurs in finding a formula for
the standard error for the statistic and relating
this formula to the sample size, n. - We will next consider an alternative approach
before returning to look at how the above
approach extends to multilevel models.
53The use of simulation
- In reality our (hoped for) research path will be
as follows - Construct research question -gt Form null
hypothesis that we believe false -gt Collect
appropriate data -gt Reject hypothesis therefore
proving our research question. - Assuming what we believe in our research question
is correct and hence null hypothesis is false we
can still be let down by not collecting enough
data. - The idea behind using simulation is to simulate
the data gathering process (assuming we know the
right answer) many times and see how often we can
reject the null hypothesis. The percentage of
rejected null hypotheses (via simulation) will
then estimate power.
54Simulation in our example
- Consider our Welsh height example case 2 where we
believe Welshmen have a mean height of 176cms
(and sd 5cms) and we are testing the hypothesis
H0µ175cms, and we consider a sample size 25. - Then we generate N samples (e.g. 5000) of size
25, - and for each sample
form a lower bound for the confidence interval of
the form - . This we compare with
the value 175 and the proportion greater than 175
is an estimate of the power of the test. - We can repeat this exercise for different sample
sizes and form a power curve.
55Power curve comparison
Note simulation curve is a good approximation of
the theoretical curve although there are some
minor (Monte Carlo) errors even with 5000
simulations per sample size.
56Advantages/Disadvantages
- Theoretical approach is quick when the formula
can be derived. - Approximations for more complex situations exist
which are equally quick. - Simulation approach generalizes to more
situations but is much slower and we may need
large numbers of simulations per scenario to get
accurate power estimates.
57What happens with multilevel data?
- We will here mainly consider 2-level models and
take as our application area education, so we
have students nested within schools. - When deciding on a sampling scheme we have many
choices - How many schools, N ?
- How many pupils per school, nj ?
- Should we collect the same size sample from each
school ? - Our decision will depend on which parameter we
wish to estimate in the model.
58Education Example
- For motivation we considered a two level dataset
with exam marks measured for each student in a
collection of schools. In fact this dataset
exists and has 4915 students in 96 schools. - Our hypothesis of interest is that the exam mark
for an average student is gt 20 (null hypothesis
20) which with such a large sample results in the
null hypothesis being rejected for our particular
data. - If we fit the following multilevel model to the
data we get the estimates given - If we treat these estimates as population values,
we are interested in what power for testing our
hypothesis results from various combinations of N
and nj
59Design effect formula
- If we assume balance then with n pupils in each
of N schools for our simple model (and only this
simple model) the following formula holds - Design effect 1 (n-1)? where ? is the
intra-class correlation. - So if we know the simple random sample size
required for a given power we need to multiply
this by the design effect. - For example our data has ?16.205/(16.205139.367)
0.104 - So for schools of size 10 pupils we would need
190.1041.94 times as many students (in total)
to get the same power. - For this model (and this model only) we could
therefore perform our power calculations assuming
simple random sampling from a population with
variance 155.572 and scale up the sample required
based on the design. - So
- And for schools of size 10 we require
1.94338.4657 pupils which we can round up to 66
schools.
60Simulating multilevel designs
- The process here is similar to the earlier
example except that we need to simulate from a
multilevel model and fit the models using MLwiN
(Rasbash, Browne et al. 2000). - To this end we will write macro code in the MLwiN
macro language to perform the task. - The MLwiN macro language allows datasets to be
simulated, models to be set up and run using
various algorithms and results collected. - It has the advantage of performing all the
operations in one package but programming in the
macro language is not for the faint hearted!
61Simulation continued
- We will perform simulations for schools of 10
pupils where number of schools (N) ranges from 5
to 70. For each N, 5000 datasets are generated. - For each dataset we need to generate 10N level 1
residuals with variance 139.367, N level 2
residuals with variance 16.205 and add these
residuals up correctly with the fixed effect
estimate 21.685. - MLwiN has commands to generate random Normally
distributed observations but also has a SIMU
command which given a model is set up and
estimates given will simulate from it directly
making life easier. - For each simulated dataset we fit the variance
components model using the RIGLS algorithm. For
small numbers of level 2 units we may have
estimation difficulties but MLwiN has an ERROR 0
command which simply ignores such problems. - Note it is also important to ensure the command
BATCH 1 is included else MLwiN may only run RIGLS
for 1 iteration for each model!!
62Comparison of formula/simulations
- The following graph compares the design effect
formula to the simulation approach
63Multilevel mastitis modelling!
- Martin Green has been successful in obtaining 4
years of funding from Wellcome to come and work
with me. - The project is entitled Use of Bayesian
statistical methods to investigate farm
management strategies, cow traits and
decision-making in the prevention of clinical and
sub-clinical mastitis in dairy cows. - Martin is a specialist farm animal veterinary
surgeon and has recently also been appointed to a
chair in Nottinghams new vet school. - He completed a PhD in 2004 at the University of
Warwick in veterinary epidemiology and used MCMC
to fit binary response multilevel models in both
MLwiN and WinBUGS to look at various factors that
affect clinical mastitis in dairy cows.
64Wellcome Fellowship
- In the 4 years of the grant we are fitting random
effect models to a large dataset that Martin has
been collecting in an earlier Milk Development
Council funded grant. - In particular we are looking at how farm
management practices, environmental conditions
and cow characteristics influence the risk of
mastitis during the dry period. - We aim to get both interesting applied results
and also some novel statistical methodology from
the grant and MCMC will again play a big part. - From a statistical point of view we are looking
at model fit diagnostics and model comparison in
binary response random effect models.
65Other applications
- Current PhD student projects
- Kelly Handley Statistical Analysis of Mass
Spectrometry data. - Chris Brignell Statistical Shape Analysis in
Brain Imaging. - Various MMath /BSc. projects looking at
applications of multilevel modelling of share
prices, educational data, house prices and
disease mapping.
66Future Work
- Co-Investigator on BBSRC grant application (joint
between Nottingham and Bristol vet schools)
entitled An investigation of Molecular
Characteristics, Infection Patterns and
Prevention of E. Coli and S. uberis Mastitis in
UK Dairy Herds. (Main investigators Andrew
Bradley Martin Green). - Have been investigating MCMC algorithms for
structured multivariate Normal models which are
what in reality the IGLS algorithm fits. This
model family also includes multilevel time series
models. I have been investigating these models
with an MMath. student and I will have a visitor
from Turkey who has government funding to visit
me and investigate the models. - I am a named collaborator on the ESRC research
node held by Kelvyn Jones and the multilevel team
in Bristol.
67References
- Browne, W.J. (2003). MCMC Estimation in MLwiN.
London Institute of Education, University of
London. - Browne, W.J. (2006). MCMC Estimation of
constrained variance matrices with applications
in multilevel modelling. Computational Statistics
and Data Analysis. 50 1655-1677. - Browne, W.J. and Draper D. (2006). A Comparison
of Bayesian and likelihood methods for fitting
multilevel models (with discussion). Bayesian
Analysis. 1 473-550. - Browne, W.J., Goldstein, H. and Rasbash, J.
(2001). Multiple membership multiple
classification (MMMC) models. Statistical
Modelling 1 103-124. - Chib, S. and Greenburg, E. (1998). Analysis of
multivariate probit models. Biometrika 85,
347-361. - Rasbash, J., Browne, W.J., Goldstein, H., Yang,
M., Plewis, I., Healy, M., Woodhouse, G.,Draper,
D., Langford, I., Lewis, T. (2000). A Users
Guide to MLwiN, Version 2.1, London Institute of
Education, University of London.