Title: Assessing Intervention Fidelity in RCTs: Concepts and Methods
1Assessing Intervention Fidelity in RCTs Concepts
and Methods
- Panelists
- David S. Cordray, PhD
- Chris Hulleman, PhD
- Joy Lesnick, PhD
- Vanderbilt University
- Presentation for the IES Research Conference
- Washington, DC
- June 12, 2008
2Overview
- Session planned as an integrated set of
presentations - Well begin with
- Definitions and distinctions
- Conceptual foundation for assessing fidelity in
RCTs, a special case. - Two examples of assessing implementation
fidelity - Chris Hulleman will illustrate an assessment for
an intervention with a single core component - Joy Lesnick illustrates additional consideration
when fidelity assessment is applied to
intervention models with multiple program
components. - Issues for the future
- Questions and discussion
3Definitions and Distinctions
4Dimensions Intervention Fidelity
- Little consensus on what is meant by the term
intervention fidelity. - But Dane Schneider (1998) identify 5 aspects
- Adherence/compliance program components are
delivered/used/received, as prescribed - Exposure amount of program content
delivered/received by participants - Quality of the delivery theory-based ideal in
terms of processes and content - Participant responsiveness engagement of the
participants and - Program differentiation unique features of the
intervention are distinguishable from other
programs (including the counterfactual)
5Distinguishing Implementation Assessment from
Implementation Fidelity Assessment
- Two models of intervention implementation, based
on - A purely descriptive model
- Answering the question What transpired as the
intervention was put in place (implemented). - An a priori intervention model, with explicit
expectations about implementation of core program
components. - Fidelity is the extent to which the realized
intervention (tTx) is faithful to the
pre-stated intervention model (TTx) - Fidelity TTx tTx
- We emphasize this model
6What to Measure?
- Adherence to the intervention model
- (1) Essential or core components (activities,
processes) - (2) Necessary, but not unique to the
theory/model, activities, processes and
structures (supporting the essential components
of T) and - (3) Ordinary features of the setting (shared with
the counterfactual groups (C) - Essential/core and Necessary components are
priority parts of fidelity assessment.
7An Example of Core Components Bransfords HPL
Model of Learning and Instruction
- John Bransford et al. (1999) postulate that a
strong learning environment entails a combination
of - Knowledge-centered
- Learner-centered
- Assessment-centered and
- Community-centered components.
- Alene Harris developed an observation system (the
VOS) that registered novel (components above) and
traditional pedagogy in classes. - The next slide focuses on the prevalence of
Bransfords recommended pedagogy.
8Challenge-based Instruction in Treatment and
Control Courses The VaNTH Observation System
(VOS)
Percentage of Course Time Using Challenge-based
Instructional Strategies
Adapted from Cox Cordray, in press
9Implications
- Fidelity can be assessed even when there is no
known benchmark (e.g., 10 Commandments) - In practice interventions can be a mixture of
components with strong, weak or no benchmarks - Control conditions can include core intervention
components due to - Contamination
- Business as usual (BAU) contains shared
components, different levels - Similar theories, models of action
- But to index fidelity, we need to measure
components within the control condition
10Linking Intervention Fidelity Assessment to
Contemporary Models of Causality
- Rubins Causal Model
- True causal effect of X is (YiTx YiC)
- RCT methodology is the best approximation to the
true effect - Fidelity assessment within RCT-based causal
analysis entails examining the difference between
causal components in the intervention and
counterfactual condition. - Differencing causal conditions can be
characterized as achieved relative strength of
the contrast. - Achieved Relative Strength (ARS) tTx tC
- ARS is a default index of fidelity
11Achieved Relative Strength .15
Expected Relative Strength .25
12In Practice.
- Identify core components in both groups
- e.g., via a Model of Change
- Establish bench marks for TTX and TC
- Measure core components to derive tTx and tC
- e.g., via a Logic model based on Model of
Change - With multiple components and multiple methods of
assessment achieved relative strength needs to
be - Standardized, and
- Combined across
- Multiple indicators
- Multiple components
- Multiple levels (HLM-wise)
- We turn to our examples.
13Assessing Implementation Fidelity in the Lab and
in Classrooms The Case of a Motivation
Intervention
- Chris S. Hulleman
- Vanderbilt University
14The Theory of Change
INTEREST
PERCEIVED UTILITY VALUE
MANIPULATED RELEVANCE
PERFORMANCE
Adapted from Hulleman (2008) Hulleman, Godes,
Hendricks, Harackiewicz (2008) Hulleman
Harackiewicz (2008) Hulleman, Hendricks,
Harackiewicz (2007) Eccles et al. (1983)
Wigfield Eccles (2002) Hulleman et al. (2008)
15Methods
Laboratory Classroom
Sample N 107 undergraduates N 182 ninth-graders 13 classes 8 teachers 3 high schools
Task Mental Multiplication Technique Biology, Physical Science, Physics
Treatment manipulation Write about how the mental math technique is relevant to your life. Pick a topic from science class and write about how it relates to your life.
Control manipulation Write a description of a picture from the learning notebook. Pick a topic from science class and write a summary of what you have learned.
Number of manipulations 1 2 8
Length of Study 1 hour 1 semester
Dependent Variable Perceived Utility Value Perceived Utility Value
16Motivational Outcome
?
g 0.05 (p .67)
17Fidelity Measurement and Achieved Relative
Strength
- Simple intervention one core component
- Intervention fidelity
- Defined as quality of participant
responsiveness - Rated on scale from 0 (none) to 3 (high)
- 2 independent raters, 88 agreement
18Quality of Responsiveness
Laboratory Laboratory Laboratory Laboratory Classroom Classroom Classroom Classroom
C C Tx Tx C C Tx Tx
Quality of Responsiveness N N N N
0 47 100 7 11 86 96 38 41
1 0 0 15 24 4 4 40 43
2 0 0 29 46 0 0 14 15
3 0 0 12 19 0 0 0 0
Total 47 100 63 100 90 100 92 100
Mean 0.00 0.00 1.73 1.73 0.04 0.04 0.74 0.74
SD 0.00 0.00 0.90 0.90 0.21 0.21 0.71 0.71
19Indexing Fidelity
- Absolute
- Compare observed fidelity (tTx) to absolute or
maximum level of fidelity (TTx) - Average
- Mean levels of observed fidelity (tTx)
- Binary
- Yes/No treatment receipt based on fidelity scores
- Requires selection of cut-off value
20Fidelity Indices
Conceptual Laboratory Classroom
Absolute Tx
C
Average Tx 1.73 0.74
C 0.00 0.04
Binary Tx
C
21Indexing Fidelity as Achieved Relative Strength
- Intervention Strength Treatment Control
- Achieved Relative Strength (ARS) Index
- Standardized difference in fidelity index across
Tx and C - Based on Hedges g (Hedges, 2007)
- Corrected for clustering in the classroom (ICCs
from .01 to .08)
22Average ARS Index
Group Difference
Sample Size Adjustment
Clustering Adjustment
- Where,
- mean for group 1 (tTx )
- mean for group 2 (tC)
- ST pooled within groups standard deviation
- nTx treatment sample size
- nC control sample size
- n average cluster size
- p Intra-class correlation (ICC)
- N total sample size
23Absolute and Binary ARS Indices
Group Difference
Sample Size Adjustment
Clustering Adjustment
- Where,
- pTx proportion for the treatment group (tTx )
- pC proportion for the control group (tC)
- nTx treatment sample size
- nC control sample size
- n average cluster size
- p Intra-class correlation (ICC)
- N total sample size
24Average ARS Index
Treatment Strength
100 66 33 0
3 2 1 0
TTx
Infidelity
t tx
(0.74)-(0.04) 0.70
tC
Infidelity
TC
25Achieved Relative Strength Indices
Observed Fidelity Observed Fidelity Lab vs. Class Contrasts
Lab Class Lab - Class
Absolute Tx 0.58 0.25
C 0.00 0.01
g 1.72 0.80 0.92
Average Tx 1.73 0.74
C 0.00 0.04
g 2.52 1.32 1.20
Binary Tx 0.65 0.15
C 0.00 0.00
g 1.88 0.80 1.08
26Linking Achieved Relative Strength to Outcomes
27Sources of Infidelity in the Classroom
- Student behaviors were nested within teacher
behaviors - Teacher dosage
- Frequency of responsiveness
- Student and teacher behaviors were used to
predict treatment fidelity (i.e., quality of
responsiveness).
28Sources of Infidelity Multi-level Analyses
- Part I Baseline Analyses
- Identified the amount of residual variability in
fidelity due to students and teachers. - Due to missing data, we estimated a 2-level model
(153 students, 6 teachers) - Student Yij b0j b1j(TREATMENT)ij rij,
- Teacher b0j ?00 u0j,
- b1j ?10 u10j
29Sources of Infidelity Multi-level Analyses
- Part II Explanatory Analyses
- Predicted residual variability in fidelity
(quality of responsiveness) with frequency of
responsiveness and teacher dosage - Student Yij b0j b1(TREATMENT)ij
- b2(RESPONSE FREQUENCY)ij rij
- Teacher b0j ?00 u0j
- b1j ?10 b10(TEACHER DOSAGE)j u10j
- b2j ?20 b20(TEACHER DOSAGE)j u20j
30Sources of Infidelity Multi-level Analyses
Baseline Model Baseline Model Explanatory Model Explanatory Model
Variance Component Residual Variance of Total Variance Reduction
Level 1 (Student) 0.15437 52 0.15346 lt 1
Level 2 (Teacher) 0.13971 48 0.04924 65
Total 0.29408 0.20270
p lt .001.
31Case Summary
- The motivational intervention was more effective
in the lab (g 0.45) than field (g 0.05). - Using 3 indices of fidelity and, in turn,
achieved relative treatment strength, revealed
that - Classroom fidelity lt Lab fidelity
- Achieved relative strength was about 1 SD less in
the classroom than the laboratory - Differences in achieved relative strength
differences motivational outcome, especially in
the lab. - Sources of fidelity teacher (not student) factors
32Assessing Fidelity of Interventions with
Multiple Components A Case of Assessing
Preschool Interventions
33What Do We Mean By Multiple Components in
Preschool Literacy Programs?
- How do you define preschool instruction?
- Academic content, materials, student-teacher
interactions, student-student interactions,
physical development, schedules routines,
assessment, family involvement, etc. etc. - How would you measure implementation?
- Preschool Interventions
- Are made up of components (e.g., sets of
activities and processes) that can be thought of
as constructs - These constructs vary in meaning, across actors
(e.g., developers, implementers, researchers) - They are of varying levels of importance within
the intervention and - These constructs are made up of smaller parts
that need to be assessed. - Multiple components makes assessing fidelity more
challenging
33
34Overview
- Four areas of consideration when assessing
fidelity of programs with multiple components - Specifying Multiple Components
- Major Variations in Program Components
- The ABCs of Item and Scale Construction
- Aggregating Indices
- One caveat Very unusual circumstances
- Goal of this work
- To build on the extensive evaluation work that
had already been completed and use the case study
to provide a framework for future efforts to
measure fidelity of implementation.
34
351. Specifying Multiple Components
- Our Process
- Extensive review of program materials
- Potentially hundreds of components
- How many indicators do we need to assess
fidelity?
35
361. Specifying Multiple Components
36
Constructs Sub-Constructs Facets
Elements Indicators
37Grain Size is Important
- Conceptual differences between programs may
happen at micro-levels - Empirical differences between program
implementation may happen at more macro levels - Theoretically expected differences vs.
empirically observed differences - Must identify conceptual differences between
programs at the smallest grain size at the
outset, although may be able to detect empirical
differences once implemented at higher macro
levels
37
382. Major Variations in Program Components
- One program often has some combination of these
different types of components - Scripted (highly structured) activities
- Unscripted (unstructured) activities
- Nesting of activities
- Micro-level (discrete) activities
- Macro-level (extended) activities
- What youre trying to measure will influence how
to measure it -- and how often it needs to be
measured.
38
392. Major Variations in Program Components
Type of Program Component Example from the Case Study Implications Abs Avg Bin ARS
Scripted (highly structured) activities In the first treatment condition, four scripted literacy circles are required. There is known criteria for assessing fidelity. Fidelity is the difference between the expected and observed values TTx tTx Yes Yes ? Yes
Unscripted (unstructured) activities In the second treatment condition, literacy circles are required, but the specific content of those group meetings is not specified. There is unknown criteria for assessing fidelity. We can only record what was done, or in comparison to control tTx No? Yes? ? Yes
Abs Absolute Fidelity Index what happened as
compared to what should have happened highest
standard Avg Magnitude or exposure level
indicates what happened, but its not very
meaningful how do we know if level is good or
bad? Bin Binary Complier Can we set a
benchmark to determine whether or not program
component was successfully implemented? gt30 for
example? Is that realistic? Meaningful? ARS
Difference in magnitude between Tx and C
relative strength is there enough difference to
warrant a treatment effect?
39
40Dots under a microscope what is it???
41Starry Night, Vincent Van Gogh, 1889
42We must measure the trees and also the forest
- Micro-level (discrete) activities
- Depending on the condition, daily activities
(i.e. whole group time, small group time, center
activities) may be scripted or unscripted and
take place within larger structure of theme under
study. - Macro-level (extended) activities
- Month long thematic unit (is structured in
treatment condition and unstructured in control)
is underlying extended structure within which
scripted or unscripted micro activities take
place. - In multi-component programs, many activities are
nested within larger activity structures. This
nesting has implications for fidelity analysis
what to measure and how to measure it.
42
433. The ABCs of Item and Scale Construction
- Aim for one-to-one correspondence of indicators
to component of interest - Balance items across components
- Coverage and quality are more important than the
quantity of items
43
44Aim for one-to-one correspondence
Diff bw T C (Oral Lang) T 1.80 (0.32) C
1.36 (0.32) ARS ES 1.38 T 3.45 (0.87) C
2.26 (0.57) ARS ES 1.62
- Example of more than one component being assessed
in one item - Does the teacher Talk with children throughout
the day, modeling correct grammar, teaching new
vocabulary, and asking questions to encourage
children to express their ideas in words?
(Yes/No) - Example of one component being measured in each
item - Teacher provides an environment wherein students
can talk about what they are doing. - Teacher listens attentively to students
discussions and responses. - Teacher models and/or encourages students to ask
questions during class discussions.
Data for the case study comes from an evaluation
conducted by Dale Farran, Mark Lipsey,
Carol Blibrey, et al.
44
45Balance items across components
Literacy Content items a
Oral language 20 0.95
Language, comprehension, and response to text 7 0.70
Book and print awareness 2 0.80
Phonemic awareness 3 0.68
Letter and word recognition 7 0.76
Writing 6 0.67
Literacy Processes
Thematic Studies 4 0.62
Structured Literacy Circles 2 0.62
- How many items are needed for each scale?
- Oral-language over represented
- Scales with alt0.80 not reliable
45
46Coverage and quality more important than quantity
Literacy Content items a
Oral language 20 0.95
Language, comprehension, and response to text 7 0.70
Book and print awareness 2 0.80
Phonemic awareness 3 0.68
Letter and word recognition 7 0.76
Writing 6 0.67
Literacy Processes
Thematic Studies 4 0.62
Structured Literacy Circles 2 0.62
- Two scales each have 2 items, but very different
levels of reliability - How many items are needed for each scale?
- Oral Language 20 items. Randomly selected
items and recalculated alpha - 10 items a 0.92
- 8 items a 0.90
- 6 items a 0.88
- 5 items a 0.82
- 4 items a 0.73
46
47Aggregating Indices
- To weight or not to weight? How do we decide?
- Possibilities
- Theory
- Consensus
- spent
- Time spent
- Case study example 2 levels of aggregation
within and between - Unit-weight within facet Instruction Content
Literacy - Hypothetical weight across sub-construct
Instruction Content
47
48YOU ARE HERE.
UNIT WEIGHT
THEORY WEIGHT
HOW WEIGHT?
HOW WEIGHT?
48
49Aggregating Indices
- Unit-weight within facet Instruction Content
Literacy
Literacy Content Average Fidelity Index Tx Average Fidelity Index C Absolute Fidelity Index Tx Absolute Fidelity Index C Achieved Relative Strength Fidelity Index (Average) Achieved Relative Strength Fidelity Index (Absolute)
Oral language 1.82 1.40 91 70 1.36 0.53
Language, comprehension, and response to text 1.74 1.37 87 69 1.45 0.44
Book and print awareness 1.91 1.39 96 70 1.38 0.73
Phonemic awareness 1.73 1.48 87 74 0.74 0.32
Letter and word recognition 1.75 1.36 88 68 1.91 0.50
Writing 1.68 1.37 84 69 1.22 0.34
Average unit weighting 1.77 1.38 89 75 1.34 0.48
49
clustering is ignored
50Aggregating Indices
- Theory-weight across sub-construct (hypothetical)
Instruction - Content Treatment Control Hypothetical Weight
Literacy 1.77 1.38 40
Math 1.51 1.80 5
Social and Personal Development 1.79 1.58 35
Scientific Thinking 1.57 1.71 5
Social Studies 1.84 1.41 5
Creative Arts 1.66 1.32 5
Physical Development 1.45 1.50 3
Technology 1.45 1.57 2
100
Unweighted Average 1.63 1.53
Weighted Average 1.74 1.49
51YOU ARE HERE
UNIT WEIGHT
THEORY WEIGHT
HOW WEIGHT?
HOW WEIGHT?
51
52Key Points and Future Issues
- Identifying and measuring, at a minimum, should
include model-based core and necessary
components - Collaborations among researchers, developers and
implementers is essential for specifying - Intervention models
- Core and essential components
- Benchmarks for TTx (e.g., an educationally
meaningful dose what level of X is needed to
instigate change) and - Tolerable adaptation
53Points and Issues
- Fidelity assessment serves two roles
- Average causal difference between conditions and
- Using fidelity measures to assess the effects of
variation in implementation on outcomes. - Should minimize infidelity and weak ARS
- Pre-experimental assessment of TTx in the
counterfactual conditionIs TTx gt TC? - Build operational models with positive
implementation drivers - Post-experimental (re)specification of the
intervention For example - MAPARS .3(planned prof.development).6(planned
use of data for differentiated instruction)
54Points and Issues
- What does an ARS of 1.20 mean?
- We need experience and a normative framework
- Cohen defined a small effect on outcomes as 0.20
medium as 0.50, and large as 0.80 - Overtime this may emerge for ARS