Clustered or Multilevel Data - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Clustered or Multilevel Data

Description:

Litters of puppies. Pieces of leaves (several per leaf) ... Health interventions often randomize institutions or geographic areas ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 35

Provided by: dennis61

Category:

more less

Transcript and Presenter's Notes

Title: Clustered or Multilevel Data

1
Clustered or Multilevel Data

What are clustered or multilevel data?
Why are multilevel data common in outcomes
research?
What methods of analysis are available?
What are random versus fixed effects?
How does the N at each level affect model choice?
How does the study question affect model choice?

2
What are clustered data?

Gathering individual observations into larger
groups does not create clustered data
Individual observations from a simple, random
sample are never clustered
Clustering is a result of sampling/design
Usually from stages/levels in obtaining the
individual units of observation

3
Examples of Clustered Data

Litters of puppies
Pieces of leaves (several per leaf)
Intervention on institutions (eg, schools)
TB cases and their contacts
Survey stratified by county and census tract
A sample of physicians and their patients
Repeated measurements on individuals

4
Clustered or Multilevel Data
Level 2 (cluster)
Physicians, schools, census tracts, leaves
Level 2 unit 3
Level 2 unit 2
Level 2 unit 1
Level 1 (individual observation)
Patients, students, residents, leaf samples
5
Cluster analysis is a different topic finds
clusters in data
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x
6
Repeated Measures are also a Type of Clustered or
Multilevel Data
Individual subjects
Level 2 (cluster)
Person 3
Person 2
Person 1
4,3
2,1
3,1
1,2
2,2
1,3
2,3
3,3
1,1
Time 1
Level 1 (individual observation)
Observations at different times
7
Multilevel Data is Common in Outcomes Research

Secondary data sets are often multilevel
Patients clustered within physicians clustered
within hospitals or clinics (hospital discharges)
National health surveys (NHIS, NHANES) are
stratified probability surveys
Health interventions often randomize institutions
or geographic areas
Health policy changes are applied at geographic
or institutional level

8
Characteristics of Clustered Data

Measurements within clusters are correlated (eg,
measures on same person are more alike than
measurements across persons)
Variables can be measured at each level
The variance of the outcome can be attributed to
each level
Standard statistical models and tests are
incorrect

9
Effects of Clustered Data

The assumptions of independence and equal
variance of standard statistics do not hold
Standard errors for statistical testing will be
incorrect
Regression models cannot be fit using methods
that assume independence of observations
For example, ordinary least squares calculation
of the regression line is incorrect

10
Example of Multilevel Data with a Linear Outcome
Variable

PORT study of type II diabetes patients
satisfaction with medical care
Outcome score from 14 questionnaire items
Sample of 70 physicians (level 2 sample)
Sample of 1492 patients (level 1 sample)
Mean 21.3 patients per physician
Range from 5 to 45 patients per physician
Two levels of covariates considered
Physician years in practice, specialty (level 2)
Patient age, gender (level 1)

11
Clustered/Multi-level Data VarianceOutcome
Patient Satisfaction Score
Level 2 Physicians (N70)
MD3 mean74
MD2 mean58
MD1 mean81
79
55
61
68
74
75
85
77
81
Level 1 Patients (N1492)
Variance in the patient score divides into two
parts (1) the variance between physicans
?2B (2) the variance within the physicians
?2W So the total variance ?2B ?2W
12
Intraclass Correlation Coefficient
The intraclass correlation coefficient (ICC) is a
measure of the correlation among the individual
observations within the clusters
It is calculated by the ratio of the between
cluster variance to the total variance
?2B / (?2B ?2W )
13
Intraclass Correlation Coefficient (ICC)
MD3 mean74
MD2 mean58
MD1 mean81
74
58
58
74
74
74
81
81
81
Take extreme case where each MDs patients
have the same score no variance within the
physicians. So, ICC ?2B / ?2B ?2W ?2B /
?2B 0 1 perfect correlation within the
clusters.
14
Methods of Analyzing Multilevel Data

Use a single measure per cluster (e.g., mean
satisfactions score) as the outcome variable
Fit a model with indicator variables for each
cluster (minus one)
Fit a regression model with generalized
estimating equations (GEE model)
Fit a fixed effects conditional regression model
Fit a random effects regression model

15
Choice of Analysis Model Two Main Considerations

What is the research question
How many observations are there at level 2 and
how many level 1 observations are there per level
2 observation

16
Choice of Analysis Model The Research Question

What is the relationship of patient age to the MD
satisfaction score? (level 1 predictor)
What is the relationship between MD years in
practice and the score? (level 2 predictor)
How much variation is there in the mean
satisfaction score between MDs adjusted for level
1 and level 2 predictors? (level 2 variance)

17
Method (1) Use mean satisfaction score for each
physician as outcome

Single measure for each cluster
simple, easy to understand
loses information, power (N70, not 1492)
ignores different variance of single outcome if
clusters are different sizes
no individual level variables except as mean
values (eg, mean patient age)
Only answers question 2 (MD years in practice)
although can use mean patient age

18
Method (2) Use dummy variable for each MD

Dummy variable represents each MD effect
treats each MD effect as equally well estimated
but some of the clusters small (N5,7,8, etc.)
If we had 70 MDs and only 200 patients, 69 dummy
variables would use up too many degrees of
freedom
If we had only 10 MDs, it is a good choice
Can only answer question 1 (relationship of
patient age to satisfaction score)

19
Method (3)Regression with Generalized
Estimating Equations (GEE)

Estimates regression coefficients and variance
separately to account for clustering
Gives population average effect of age on
satisfaction (marginal model)
Analyst indicates correlation structure within
the clusters
Answers questions 1 and 2 but not 3
Variation in patient satisfaction between MDs is
not modeled separately

20
Specifying Correlation within Clusters for GEE
Model

Most common assumption is one correlation
coefficient for all pairs of observations within
the clusters called compound symmetry or
exchangeable correlation structure
Other assumptions about the correlation are
possible (eg, correlation weakens with
time/distance)
The GEE regression will give good estimate of
predictor coefficients even if the correlation
specified is incorrect if you use the robust ses

21
Method (4) Use Conditional Regression Model with
Fixed Effects

Looks within each MD to model the association
between patient age and the score
No coefficient for MD (conditioned out)
Good choice if number of MDs large relative to
number of patients (70 MDs, 200 patients)
Matched pairs are analyzed with conditional
regression
Answers question 1, but not 2 and 3

22
Method (5) Use a Random Effects Regression Model

Predictor variables for both individual and
cluster level variables
Models variance associated with MD separately
from variance within the clusters in patient
satisfaction
Improves estimate of MD effect by treating MD
mean scores as random sample of scores
Only model that answers all 3 questions

23
Fixed versus Random Effects

Effects are random when the levels are a sample
of a larger population
have variation because sampled another sample
would give different data
Effects are fixed if they represent all possible
levels/members of a population
eg, male/female treatment groups all the
regions of the U.S.

24
Fixed versus Random Effects

Effects can often be considered fixed or random
depending on the research question
If you want to generalize from the sample of
doctors to other doctors, you would consider the
doctors as a random effect
If the doctors in your sample are the only ones
you care about, you could consider doctors as a
fixed effect

25
Random Effects Illustrated from the PORT Diabetes
Study

In the MD satisfaction score example, begin by
ignoring predictors such as the patients age and
the physicians number of years in practice
The overall mean patient satisfaction score for
all 1492 patients was 67.7 (SD23.5)
Separate means calculated for each physicians
patients ranged from 53.4 to 87.1

26
Random Effects MD Score

Consider the satisfaction score as composed of
two parts the overall mean (?) plus or minus the
difference from that overall mean of the mean
score for each physician (?j)
Each MDs difference, ?j, is a random effect
because the 70 MDs represent a sample of
possible MDs.
If we sampled another 70, the ?js would be
different

27
A Simple Random Effects Model

If we add a term for error associated with each
individual patient, the model is
yij ? ?j eij,
where ? overall mean, ?j difference for MD,
and eij individual error
Model says there is random variation from the
mean score at the level of MDs (level 2) plus
variation at the level of patients (level 1)

28
What does the random effects model do?

Actual MD means vary from 53.4 to 87.1 and
patient N for each MD varies from 5 to 45. Thus,
actual MD means not very stable.
Random effects model assumes MD mean scores are
from an underlying normal distribution
It uses the information from all the MDs and the
characteristics of a normal distribution to
estimate the true ?js

29
Estimating the Random Effects

In our example from the PORT study, raw means
range 53.4 to 87.1
Ordinary least squares estimates range 54.0 and
87.9 (term for each MD, ANCOVA)
The random effects estimates of the mean patient
scores by MD ranged from 60.4 to 78.6 their SD
was 4.94.
so random effects are closer to the overall mean

30
Adding MD and Patient Predictors to the Simple
Model

We want to examine the effect of patients age
(level 1 variable) and MD years in practice
(level 2) on the satisfaction score
Specify a regression model with 2 predictor
variables and a random effect for the MD
Score for each MD is modeled both by adjusting
for patients age and MD years in practice and by
modeling the distribution of MD mean scores

31
Final Random (or Mixed) Effects Regression Model

Positive association with patient age (?0.15,
p0.003, satisfaction score goes up with age)
No association with MD years in practice (p0.69)
Significant variance (24.4) in satisfaction score
by MD (random effect)

32
Summary

Clustered data should not be analyzed with
standard statistical methods and tests
Reduction of outcome and predictors to one value
per cluster is an option but loses information
Choice of remaining methods (dummy variables,
conditional regression, GEE, or random effects)
depends on the research question and on the
number of observations at each level

33
Summary

Research questions affect choice of method
if only care about predictors, GEE models are a
good alternative
if question is about variation between clusters
(level 2 variable), a model that produces random
effects estimates is needed
Number of clusters has to be large enough to
estimate a random effect (N30)
Small number of clusters can be handled with
dummy variables

34
Data Set for Homework