Clustered or Multilevel Data - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Clustered or Multilevel Data

Description:

Litters of puppies. Pieces of leaves (several per leaf) ... Health interventions often randomize institutions or geographic areas ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 35
Provided by: dennis61
Category:

less

Transcript and Presenter's Notes

Title: Clustered or Multilevel Data


1
Clustered or Multilevel Data
  • What are clustered or multilevel data?
  • Why are multilevel data common in outcomes
    research?
  • What methods of analysis are available?
  • What are random versus fixed effects?
  • How does the N at each level affect model choice?
  • How does the study question affect model choice?

2
What are clustered data?
  • Gathering individual observations into larger
    groups does not create clustered data
  • Individual observations from a simple, random
    sample are never clustered
  • Clustering is a result of sampling/design
  • Usually from stages/levels in obtaining the
    individual units of observation

3
Examples of Clustered Data
  • Litters of puppies
  • Pieces of leaves (several per leaf)
  • Intervention on institutions (eg, schools)
  • TB cases and their contacts
  • Survey stratified by county and census tract
  • A sample of physicians and their patients
  • Repeated measurements on individuals

4
Clustered or Multilevel Data
Level 2 (cluster)
Physicians, schools, census tracts, leaves
Level 2 unit 3
Level 2 unit 2
Level 2 unit 1
Level 1 (individual observation)
Patients, students, residents, leaf samples
5
Cluster analysis is a different topic finds
clusters in data
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x
6
Repeated Measures are also a Type of Clustered or
Multilevel Data
Individual subjects
Level 2 (cluster)
Person 3
Person 2
Person 1
4,3
2,1
3,1
1,2
2,2
1,3
2,3
3,3
1,1
Time 1
Level 1 (individual observation)
Observations at different times
7
Multilevel Data is Common in Outcomes Research
  • Secondary data sets are often multilevel
  • Patients clustered within physicians clustered
    within hospitals or clinics (hospital discharges)
  • National health surveys (NHIS, NHANES) are
    stratified probability surveys
  • Health interventions often randomize institutions
    or geographic areas
  • Health policy changes are applied at geographic
    or institutional level

8
Characteristics of Clustered Data
  • Measurements within clusters are correlated (eg,
    measures on same person are more alike than
    measurements across persons)
  • Variables can be measured at each level
  • The variance of the outcome can be attributed to
    each level
  • Standard statistical models and tests are
    incorrect

9
Effects of Clustered Data
  • The assumptions of independence and equal
    variance of standard statistics do not hold
  • Standard errors for statistical testing will be
    incorrect
  • Regression models cannot be fit using methods
    that assume independence of observations
  • For example, ordinary least squares calculation
    of the regression line is incorrect

10
Example of Multilevel Data with a Linear Outcome
Variable
  • PORT study of type II diabetes patients
    satisfaction with medical care
  • Outcome score from 14 questionnaire items
  • Sample of 70 physicians (level 2 sample)
  • Sample of 1492 patients (level 1 sample)
  • Mean 21.3 patients per physician
  • Range from 5 to 45 patients per physician
  • Two levels of covariates considered
  • Physician years in practice, specialty (level 2)
  • Patient age, gender (level 1)

11
Clustered/Multi-level Data VarianceOutcome
Patient Satisfaction Score
Level 2 Physicians (N70)
MD3 mean74
MD2 mean58
MD1 mean81
79
55
61
68
74
75
85
77
81
Level 1 Patients (N1492)
Variance in the patient score divides into two
parts (1) the variance between physicans
?2B (2) the variance within the physicians
?2W So the total variance ?2B ?2W
12
Intraclass Correlation Coefficient
The intraclass correlation coefficient (ICC) is a
measure of the correlation among the individual
observations within the clusters
It is calculated by the ratio of the between
cluster variance to the total variance
?2B / (?2B ?2W )
13
Intraclass Correlation Coefficient (ICC)
MD3 mean74
MD2 mean58
MD1 mean81
74
58
58
74
74
74
81
81
81
Take extreme case where each MDs patients
have the same score no variance within the
physicians. So, ICC ?2B / ?2B ?2W ?2B /
?2B 0 1 perfect correlation within the
clusters.
14
Methods of Analyzing Multilevel Data
  • Use a single measure per cluster (e.g., mean
    satisfactions score) as the outcome variable
  • Fit a model with indicator variables for each
    cluster (minus one)
  • Fit a regression model with generalized
    estimating equations (GEE model)
  • Fit a fixed effects conditional regression model
  • Fit a random effects regression model

15
Choice of Analysis Model Two Main Considerations
  • What is the research question
  • How many observations are there at level 2 and
    how many level 1 observations are there per level
    2 observation

16
Choice of Analysis Model The Research Question
  • What is the relationship of patient age to the MD
    satisfaction score? (level 1 predictor)
  • What is the relationship between MD years in
    practice and the score? (level 2 predictor)
  • How much variation is there in the mean
    satisfaction score between MDs adjusted for level
    1 and level 2 predictors? (level 2 variance)

17
Method (1) Use mean satisfaction score for each
physician as outcome
  • Single measure for each cluster
  • simple, easy to understand
  • loses information, power (N70, not 1492)
  • ignores different variance of single outcome if
    clusters are different sizes
  • no individual level variables except as mean
    values (eg, mean patient age)
  • Only answers question 2 (MD years in practice)
    although can use mean patient age

18
Method (2) Use dummy variable for each MD
  • Dummy variable represents each MD effect
  • treats each MD effect as equally well estimated
    but some of the clusters small (N5,7,8, etc.)
  • If we had 70 MDs and only 200 patients, 69 dummy
    variables would use up too many degrees of
    freedom
  • If we had only 10 MDs, it is a good choice
  • Can only answer question 1 (relationship of
    patient age to satisfaction score)

19
Method (3)Regression with Generalized
Estimating Equations (GEE)
  • Estimates regression coefficients and variance
    separately to account for clustering
  • Gives population average effect of age on
    satisfaction (marginal model)
  • Analyst indicates correlation structure within
    the clusters
  • Answers questions 1 and 2 but not 3
  • Variation in patient satisfaction between MDs is
    not modeled separately

20
Specifying Correlation within Clusters for GEE
Model
  • Most common assumption is one correlation
    coefficient for all pairs of observations within
    the clusters called compound symmetry or
    exchangeable correlation structure
  • Other assumptions about the correlation are
    possible (eg, correlation weakens with
    time/distance)
  • The GEE regression will give good estimate of
    predictor coefficients even if the correlation
    specified is incorrect if you use the robust ses

21
Method (4) Use Conditional Regression Model with
Fixed Effects
  • Looks within each MD to model the association
    between patient age and the score
  • No coefficient for MD (conditioned out)
  • Good choice if number of MDs large relative to
    number of patients (70 MDs, 200 patients)
  • Matched pairs are analyzed with conditional
    regression
  • Answers question 1, but not 2 and 3

22
Method (5) Use a Random Effects Regression Model
  • Predictor variables for both individual and
    cluster level variables
  • Models variance associated with MD separately
    from variance within the clusters in patient
    satisfaction
  • Improves estimate of MD effect by treating MD
    mean scores as random sample of scores
  • Only model that answers all 3 questions

23
Fixed versus Random Effects
  • Effects are random when the levels are a sample
    of a larger population
  • have variation because sampled another sample
    would give different data
  • Effects are fixed if they represent all possible
    levels/members of a population
  • eg, male/female treatment groups all the
    regions of the U.S.

24
Fixed versus Random Effects
  • Effects can often be considered fixed or random
    depending on the research question
  • If you want to generalize from the sample of
    doctors to other doctors, you would consider the
    doctors as a random effect
  • If the doctors in your sample are the only ones
    you care about, you could consider doctors as a
    fixed effect

25
Random Effects Illustrated from the PORT Diabetes
Study
  • In the MD satisfaction score example, begin by
    ignoring predictors such as the patients age and
    the physicians number of years in practice
  • The overall mean patient satisfaction score for
    all 1492 patients was 67.7 (SD23.5)
  • Separate means calculated for each physicians
    patients ranged from 53.4 to 87.1

26
Random Effects MD Score
  • Consider the satisfaction score as composed of
    two parts the overall mean (?) plus or minus the
    difference from that overall mean of the mean
    score for each physician (?j)
  • Each MDs difference, ?j, is a random effect
    because the 70 MDs represent a sample of
    possible MDs.
  • If we sampled another 70, the ?js would be
    different

27
A Simple Random Effects Model
  • If we add a term for error associated with each
    individual patient, the model is
  • yij ? ?j eij,
  • where ? overall mean, ?j difference for MD,
    and eij individual error
  • Model says there is random variation from the
    mean score at the level of MDs (level 2) plus
    variation at the level of patients (level 1)

28
What does the random effects model do?
  • Actual MD means vary from 53.4 to 87.1 and
    patient N for each MD varies from 5 to 45. Thus,
    actual MD means not very stable.
  • Random effects model assumes MD mean scores are
    from an underlying normal distribution
  • It uses the information from all the MDs and the
    characteristics of a normal distribution to
    estimate the true ?js

29
Estimating the Random Effects
  • In our example from the PORT study, raw means
    range 53.4 to 87.1
  • Ordinary least squares estimates range 54.0 and
    87.9 (term for each MD, ANCOVA)
  • The random effects estimates of the mean patient
    scores by MD ranged from 60.4 to 78.6 their SD
    was 4.94.
  • so random effects are closer to the overall mean

30
Adding MD and Patient Predictors to the Simple
Model
  • We want to examine the effect of patients age
    (level 1 variable) and MD years in practice
    (level 2) on the satisfaction score
  • Specify a regression model with 2 predictor
    variables and a random effect for the MD
  • Score for each MD is modeled both by adjusting
    for patients age and MD years in practice and by
    modeling the distribution of MD mean scores

31
Final Random (or Mixed) Effects Regression Model
  • Positive association with patient age (?0.15,
    p0.003, satisfaction score goes up with age)
  • No association with MD years in practice (p0.69)
  • Significant variance (24.4) in satisfaction score
    by MD (random effect)

32
Summary
  • Clustered data should not be analyzed with
    standard statistical methods and tests
  • Reduction of outcome and predictors to one value
    per cluster is an option but loses information
  • Choice of remaining methods (dummy variables,
    conditional regression, GEE, or random effects)
    depends on the research question and on the
    number of observations at each level

33
Summary
  • Research questions affect choice of method
  • if only care about predictors, GEE models are a
    good alternative
  • if question is about variation between clusters
    (level 2 variable), a model that produces random
    effects estimates is needed
  • Number of clusters has to be large enough to
    estimate a random effect (N30)
  • Small number of clusters can be handled with
    dummy variables

34
Data Set for Homework
  • CA hospitals CABG registry
  • Patients (N28,555) clustered within hospitals
    (N80)
  • Binary outcome alive/dead after 30 days
  • Patient level characteristics and hospital
    characteristics
  • Use STATA to answer questions syntax for the
    models supplied
Write a Comment
User Comments (0)
About PowerShow.com