Title: Clustered or Multilevel Data
1Clustered or Multilevel Data
- What are clustered or multilevel data?
- Why are multilevel data common in outcomes
research? - What methods of analysis are available?
- What are random versus fixed effects?
- How does the N at each level affect model choice?
- How does the study question affect model choice?
2What are clustered data?
- Gathering individual observations into larger
groups does not create clustered data - Individual observations from a simple, random
sample are never clustered - Clustering is a result of sampling/design
- Usually from stages/levels in obtaining the
individual units of observation
3Examples of Clustered Data
- Litters of puppies
- Pieces of leaves (several per leaf)
- Intervention on institutions (eg, schools)
- TB cases and their contacts
- Survey stratified by county and census tract
- A sample of physicians and their patients
- Repeated measurements on individuals
4Clustered or Multilevel Data
Level 2 (cluster)
Physicians, schools, census tracts, leaves
Level 2 unit 3
Level 2 unit 2
Level 2 unit 1
Level 1 (individual observation)
Patients, students, residents, leaf samples
5Cluster analysis is a different topic finds
clusters in data
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x
6Repeated Measures are also a Type of Clustered or
Multilevel Data
Individual subjects
Level 2 (cluster)
Person 3
Person 2
Person 1
4,3
2,1
3,1
1,2
2,2
1,3
2,3
3,3
1,1
Time 1
Level 1 (individual observation)
Observations at different times
7Multilevel Data is Common in Outcomes Research
- Secondary data sets are often multilevel
- Patients clustered within physicians clustered
within hospitals or clinics (hospital discharges) - National health surveys (NHIS, NHANES) are
stratified probability surveys - Health interventions often randomize institutions
or geographic areas - Health policy changes are applied at geographic
or institutional level
8Characteristics of Clustered Data
- Measurements within clusters are correlated (eg,
measures on same person are more alike than
measurements across persons) - Variables can be measured at each level
- The variance of the outcome can be attributed to
each level - Standard statistical models and tests are
incorrect
9Effects of Clustered Data
- The assumptions of independence and equal
variance of standard statistics do not hold - Standard errors for statistical testing will be
incorrect - Regression models cannot be fit using methods
that assume independence of observations - For example, ordinary least squares calculation
of the regression line is incorrect
10Example of Multilevel Data with a Linear Outcome
Variable
- PORT study of type II diabetes patients
satisfaction with medical care - Outcome score from 14 questionnaire items
- Sample of 70 physicians (level 2 sample)
- Sample of 1492 patients (level 1 sample)
- Mean 21.3 patients per physician
- Range from 5 to 45 patients per physician
- Two levels of covariates considered
- Physician years in practice, specialty (level 2)
- Patient age, gender (level 1)
11Clustered/Multi-level Data VarianceOutcome
Patient Satisfaction Score
Level 2 Physicians (N70)
MD3 mean74
MD2 mean58
MD1 mean81
79
55
61
68
74
75
85
77
81
Level 1 Patients (N1492)
Variance in the patient score divides into two
parts (1) the variance between physicans
?2B (2) the variance within the physicians
?2W So the total variance ?2B ?2W
12Intraclass Correlation Coefficient
The intraclass correlation coefficient (ICC) is a
measure of the correlation among the individual
observations within the clusters
It is calculated by the ratio of the between
cluster variance to the total variance
?2B / (?2B ?2W )
13Intraclass Correlation Coefficient (ICC)
MD3 mean74
MD2 mean58
MD1 mean81
74
58
58
74
74
74
81
81
81
Take extreme case where each MDs patients
have the same score no variance within the
physicians. So, ICC ?2B / ?2B ?2W ?2B /
?2B 0 1 perfect correlation within the
clusters.
14Methods of Analyzing Multilevel Data
- Use a single measure per cluster (e.g., mean
satisfactions score) as the outcome variable - Fit a model with indicator variables for each
cluster (minus one) - Fit a regression model with generalized
estimating equations (GEE model) - Fit a fixed effects conditional regression model
- Fit a random effects regression model
15Choice of Analysis Model Two Main Considerations
- What is the research question
- How many observations are there at level 2 and
how many level 1 observations are there per level
2 observation
16Choice of Analysis Model The Research Question
- What is the relationship of patient age to the MD
satisfaction score? (level 1 predictor) - What is the relationship between MD years in
practice and the score? (level 2 predictor) - How much variation is there in the mean
satisfaction score between MDs adjusted for level
1 and level 2 predictors? (level 2 variance)
17Method (1) Use mean satisfaction score for each
physician as outcome
- Single measure for each cluster
- simple, easy to understand
- loses information, power (N70, not 1492)
- ignores different variance of single outcome if
clusters are different sizes - no individual level variables except as mean
values (eg, mean patient age) - Only answers question 2 (MD years in practice)
although can use mean patient age
18Method (2) Use dummy variable for each MD
- Dummy variable represents each MD effect
- treats each MD effect as equally well estimated
but some of the clusters small (N5,7,8, etc.) - If we had 70 MDs and only 200 patients, 69 dummy
variables would use up too many degrees of
freedom - If we had only 10 MDs, it is a good choice
- Can only answer question 1 (relationship of
patient age to satisfaction score)
19Method (3)Regression with Generalized
Estimating Equations (GEE)
- Estimates regression coefficients and variance
separately to account for clustering - Gives population average effect of age on
satisfaction (marginal model) - Analyst indicates correlation structure within
the clusters - Answers questions 1 and 2 but not 3
- Variation in patient satisfaction between MDs is
not modeled separately
20Specifying Correlation within Clusters for GEE
Model
- Most common assumption is one correlation
coefficient for all pairs of observations within
the clusters called compound symmetry or
exchangeable correlation structure - Other assumptions about the correlation are
possible (eg, correlation weakens with
time/distance) - The GEE regression will give good estimate of
predictor coefficients even if the correlation
specified is incorrect if you use the robust ses
21Method (4) Use Conditional Regression Model with
Fixed Effects
- Looks within each MD to model the association
between patient age and the score - No coefficient for MD (conditioned out)
- Good choice if number of MDs large relative to
number of patients (70 MDs, 200 patients) - Matched pairs are analyzed with conditional
regression - Answers question 1, but not 2 and 3
22Method (5) Use a Random Effects Regression Model
- Predictor variables for both individual and
cluster level variables - Models variance associated with MD separately
from variance within the clusters in patient
satisfaction - Improves estimate of MD effect by treating MD
mean scores as random sample of scores - Only model that answers all 3 questions
23Fixed versus Random Effects
- Effects are random when the levels are a sample
of a larger population - have variation because sampled another sample
would give different data - Effects are fixed if they represent all possible
levels/members of a population - eg, male/female treatment groups all the
regions of the U.S.
24Fixed versus Random Effects
- Effects can often be considered fixed or random
depending on the research question - If you want to generalize from the sample of
doctors to other doctors, you would consider the
doctors as a random effect - If the doctors in your sample are the only ones
you care about, you could consider doctors as a
fixed effect
25Random Effects Illustrated from the PORT Diabetes
Study
- In the MD satisfaction score example, begin by
ignoring predictors such as the patients age and
the physicians number of years in practice - The overall mean patient satisfaction score for
all 1492 patients was 67.7 (SD23.5) - Separate means calculated for each physicians
patients ranged from 53.4 to 87.1
26Random Effects MD Score
- Consider the satisfaction score as composed of
two parts the overall mean (?) plus or minus the
difference from that overall mean of the mean
score for each physician (?j) - Each MDs difference, ?j, is a random effect
because the 70 MDs represent a sample of
possible MDs. - If we sampled another 70, the ?js would be
different
27A Simple Random Effects Model
- If we add a term for error associated with each
individual patient, the model is - yij ? ?j eij,
- where ? overall mean, ?j difference for MD,
and eij individual error - Model says there is random variation from the
mean score at the level of MDs (level 2) plus
variation at the level of patients (level 1)
28What does the random effects model do?
- Actual MD means vary from 53.4 to 87.1 and
patient N for each MD varies from 5 to 45. Thus,
actual MD means not very stable. - Random effects model assumes MD mean scores are
from an underlying normal distribution - It uses the information from all the MDs and the
characteristics of a normal distribution to
estimate the true ?js
29Estimating the Random Effects
- In our example from the PORT study, raw means
range 53.4 to 87.1 - Ordinary least squares estimates range 54.0 and
87.9 (term for each MD, ANCOVA) - The random effects estimates of the mean patient
scores by MD ranged from 60.4 to 78.6 their SD
was 4.94. - so random effects are closer to the overall mean
30Adding MD and Patient Predictors to the Simple
Model
- We want to examine the effect of patients age
(level 1 variable) and MD years in practice
(level 2) on the satisfaction score - Specify a regression model with 2 predictor
variables and a random effect for the MD - Score for each MD is modeled both by adjusting
for patients age and MD years in practice and by
modeling the distribution of MD mean scores
31Final Random (or Mixed) Effects Regression Model
- Positive association with patient age (?0.15,
p0.003, satisfaction score goes up with age) - No association with MD years in practice (p0.69)
- Significant variance (24.4) in satisfaction score
by MD (random effect)
32Summary
- Clustered data should not be analyzed with
standard statistical methods and tests - Reduction of outcome and predictors to one value
per cluster is an option but loses information - Choice of remaining methods (dummy variables,
conditional regression, GEE, or random effects)
depends on the research question and on the
number of observations at each level
33Summary
- Research questions affect choice of method
- if only care about predictors, GEE models are a
good alternative - if question is about variation between clusters
(level 2 variable), a model that produces random
effects estimates is needed - Number of clusters has to be large enough to
estimate a random effect (N30) - Small number of clusters can be handled with
dummy variables
34Data Set for Homework
- CA hospitals CABG registry
- Patients (N28,555) clustered within hospitals
(N80) - Binary outcome alive/dead after 30 days
- Patient level characteristics and hospital
characteristics - Use STATA to answer questions syntax for the
models supplied