Title: Comparing Results from RCTs and QuasiExperiments that share the same Intervention Group
1Comparing Results from RCTs and
Quasi-Experiments that share the same
Intervention Group
- Thomas D. Cook Northwestern University
2Why RCTs are to be preferred
- Statistical theory re expectations
- Relative advantage over other bias-free
methods--e.g., regression-discontinuity (RDD) and
instrumental variables (IV) - Ad hoc theory and research on implementation
- Privileged credibility in science and policy
- Claim that non-exp. alternatives routinely fail
to produce similar causal estimates
3Dissimilar Estimates
- Come from empirical studies comparing exp. and
non-exp. results on same topic - Strongest are within-study comparisons
- These take an experiment, throw out the control
group, and substitute a non-equivalent comparison
group - Given the intervention group is a constant, this
is a test of the different control groups
4Within-Study Comparison Lit.
- 20 studies, mostly in job training. Of the 14 in
job training reviews contend - (1) no study produces a clearly similar causal
estimate, including Deheija Wahba - (2) Some design and analysis features associated
with less bias, but still bias - (3) the average of the experiments is not
different from the average of the
non-experiments--but be careful here and note the
variance of the effect sizes differs by design
type
5Brief History of Literature on Within Study
Comparisons
- LaLonde Fraker Maynard
- 12 subsequent studies in job training
- Extension to examples in education in USA and
social welfare in Mexico, never yet reviewed
6Policy Consequences
- Department of Labor, as early as 1985
- Health and Human Services, job training and
beyond - National Academy of Sciences
- Institute of Educational Sciences
- Do within-study comparisons deserve all this?
7We will
- Deconstruct non-experiment and compare
experimental estimates to - 1. Regression-discontinuity estimates
- 2. Estimates from difference-of-differences
(fixed effects) design - Ask Is general conclusion about the inadequacy
of non-experiments true across at least these
different kinds of non-experiment
8Criteria of Good Within-Study Comparison Design
- 1. Variation in mode of assignment--random or not
- 2. No third variables correlated with both
assignment and outcome--e.g., measurement - 3. Randomized experiment properly executed
- 4. Quasi-experiment good instance of type
- 5. Both design types estimate the same causal
entity--e.g, LATE in regression-discontinuity - 6. Acceptable criteria of correspondence between
design types--ESs seem similar not formally
differ stat significance patterns not differ,
etc.
9Experiments vs. Regression-Discontinuity Design
Studies
10Three Known within-Study Comparisons of Exp and
R-D
- Aiken, West et al (1998)- R-D study experiment
LATE analysis results - Buddelmeyer Skoufias (2003)-R-D study
experiment LATE analysis results - Black, Galdo Smith (2005)-R-D study
experiment LATE analysis results
11Comments on R-D vs Exp.
- Cumulative correspondence demonstrated over three
cases - Is this theoretically trivial, though?
- Is it pragmatically significant, given variation
in implementation in both the experiment and R-D? - As existence proof, it belies over-generalized
argument that non-experiments dont work - As practical issue, does it mean we should
support RDD when treatments are assigned by need,
merit. - Emboldens to deconstruct non-experiment further
12Experiment vs Differences-in-Differences
- Most frequent non-experimental design by far
across many fields of study - Also modal in within-study comparisons in job
training, and so it provides major basis for past
opinion that non-experiments are routinely biased - We review 3 studies with comparable estimates
- 14 job training studies with dissimilar estimates
- 2 education examples with dissimilar estimates
13Bloom et al
- Bloom et al (2002 2005)--job training the topic
- Experiment 11 sites - 8 pre earning waves 20
post - Non-Experiment 5 within-state comparisons 4
within-city all comparison Ss enrolled in
welfare - We present only control/comparison contrast
because treatment time series is a constant
14Issue is
- Is there overall difference between control
groups randomly or non-randomly formed? - If yes, can statistical controlsOLS, IV (incl.
Heckman models), propensity scores, random growth
modelseliminate this difference? - Tested 1O modes, but only one longitudinal
- Why we treat this as d-in-d rather than ITS
15Bloom et al. Results
16Bloom et al. Results (continued)
17Implications of Bloom et al
- Averaging across the 4 within-city sites showed
no difference-also true if 5th between-city site
added - Selecting within-study comparisons obviated the
need for statistical adjustments for
non-equivalence--design alone did it. - Bloom et al tested differential effects of
statistical adjustments in between-state
comparisons where there were large differences - None worked, or did better than OLS
18Aiken et al (1998) Revisited
- The experiment. Remember that sample was
selected on narrow range of test score values - Quasi-Experiment--sample selection limited to
students who register late or cannot be found in
summer but who score in the same range as the
experiment - No differences between experiment and
non-experiment on test scores or pretest writing
tests - Measurement identical in experiment and non-exp
19Results for Aiken et al
- Writing standardized test .59 and .57 - sig
- Rated essay .06 and .16 ns
- High degree of comparability in statistical test
results and effect size estimates
20Implications of Aiken et al
- Like Bloom et al, careful selection of sample
gets close correspondence on important
observables. - Little need for stat adjustment for
non-equivalence limited only to unobservables - Statistical adjustment minor compared to use of
sampling design to construct initial
correspondence
21What happens if there is an initial selection
difference?
- Shadish, Luellen Clark (2006)
22Figure 1 Design of Shadish et al. (2006)
N 445 Undergraduate Psychology Students
Pretests, and then Random Assignment to
Randomized Experiment n 235 Randomly Assigned to
Nonrandomized Experiment n 210 Self-Selected
into
Mathematics Training n 79
Vocabulary Training n 131
Mathematics Training n 119
Vocabulary Training n 116
All participants measured on both mathematics and
vocabulary outcomes
23Whats special in Shadish et al
- Variation in mode of assignment
- Hold constant most other factors thru first
RA--population/measures /activity patterns - Good experiment? Pretests short-term and
attrition no chance for contamination. - Good quasi-experiment? - selection process
quality of measurement analysis and role of
Rosenbaum
24Results Shadish et al
25Implications of Shadish et al
- Here the sampling design produced non- equivalent
groups on observables, unlike Bloom - Here the statistical adjustments worked when
computed as propensity scores - However, big overlap in experimental and
non-experimental scores due to first stage random
assignment, making propensity scores more valid - Extensive, unusually valid measurement of a
relatively simple selection process, though not
homogeneous.
26Limitations to Shadish et al
- What about more complex settings?
- What about more complex selection processes?
- What about OLS and other analyses?
- This is not a unique test of propensity scores!
27Examine Within-Study Comparison Studies with
different Results
- The Bulk of the Job Training Comparisons
- Two Examples from Education
28Earliest Job Training Studies Adding to
Smith/Todd Critique
- Mode of Assignment clearly varied
- We assume RCT implemented reasonably well
- But third variable irrelevancies were not
controlled, esp location and measurement, given
dependence on matching from extant data sets - Large initial differences between randomly and
non-randomly formed comparison groups - Reliance on statistical adjustment to reduce
selection, and not initial design
29Recent Educational Examples
30Agodini M. Dynarski (2004)
- Drop-out prevention experiment, 16 m/h schools
- Individual students, likely dropouts, were
randomly assigned within schools16 replicates - Quasi-Experimentstudents matched from 2 quite
different sources middle school controls in
another study, and national NELS data. - Matching on individual and school demographic
factors - 4 outcomes examined and so in non-experiment
- 128 propensity scores -16 x 4 x 2--computed
basically from demographic background variables
31Results
- Only 29 of 128 cases were balanced matches
obtained - Why quality matching so rare? In non-experiment,
groups hardly overlap. Treatment group is high
and middle schools, but comparisons are middle
only or from a very non-local national data set - Mixed pattern of outcome correspondences in 29
cases of computable propensity scores. Not good - OLS did as well as propensity scores
32Critique
- Who would design a quasi-experiment this way? Is
a mediocre non-experiment being compared to a
good experiment? - Alternative design might have been
- 1. Regression-discontinuity.
- 2. Local comparison schools, same selection
mechanism to select similar comparison students.
3 Use of multi-year prior achievement data.
33Wilde Hollister (2005)
- The Experimentreducing class size in 11 sites
no pretest used at the individual level - Quasi-experimental designindividuals in reduced
classes matched to individual cases from other 10
sites - Propensity scores mostly demographic
- Analysis treat each site as a separate experiment
- And so 11 replicates comparing an experimental
and non-experimental effect size
34Results
- Low level of correspondence in experimental and
non-experimental effect sizes across the 11 sites - So for each site it makes a causal difference
whether experiment or quasi-experiment - When aggregated across sites, results closer exp
.68 non-exp 1.07 - But they do reliably differ
35Critique
- Who would design a quasi-exp on this topic
without a pretest on same scale as outcome? - Who would design it with these controls?
- Instead select controls from one or more matched
schools on prior achievement history - Again, a good experiment is being compared to a
bad quasi-experiment - Who would treat this as 11 separate experiments
vs. a more stable pooled experiment? Even the
authors, pooled results are much more congruent.
36Hypothesis is that...
- The job training and educational examples that
produce different conclusions from the experiment
are examples of poor quasi-experimental design - To compare good exp to poor quasi-exp is to
confound a design type and the quality of its
implementationa logical fallacy - But I reach this conclusion ex post facto and
knowing the randomized experimental results in
advance
37Big Conclusions
- R-D has given results not much different from
experiment in three of three cases. - Simpler Quasi-Experiments tend to give same
results as experiment if (a) population matching
in the sampling designBloom and Aiken studies,
or if (b) careful conceptualization and
measurement of selection model, as in Shadish et.
38What I am not Concluding
- That well designed quasi-experiment is as good as
an experiment. Difference in - Number and transparency of assumptions
- Statistical power
- Knowledge of implementation
- Social and political acceptance
- If you have the option, do an experiment because
- you can rarely put right by statistics what
you have messed up by design
39What I am suggesting you consider
- Whether this be a unit on RCTs or quality causal
studies - Whether you want to do RDD studies in cases where
an experiment is not possible because resources
are distributed otherwise - Whether you want to do quasi-experiments if group
matching on the pretest is possible, as in many
school-level interventions?
40More Contentiously if
- The selection process can be conceptualized,
observed and measured very well. - An abbreviated ITS analysis is possible, as in
Bloom et al. - The instinct to avoid quasi-experiments is
correct, but it reduces the scope of the causal
issues that can be examined
41(No Transcript)
42(No Transcript)
43Shadish, Luellen Clark (2006)
44Shadish, Luellen Clark (2006)
45Results-Aiken et al
- pretest values on SAT/CAT, 2 writing measures
- Measurement framework the same
- Pretest ACTs and writing - ns exp vs non
- OLS tests
- Results for writing test .59 and .57 - sig
- Results for essay .06 and .16 - ns
46Bloom et al Revisited
- Analysis at the individual level
- Within city, within welfare to work center, same
measurement design - Absolute bias- yes
- Average bias none across 5 within-state sites,
even w/o stat tests - Average bias limited to small site and
non-within-city site-Detroit vs Grand Rapids
47Correspondence Criteria
- Random error and no exact agreement
- Shared stat sig pattern from zero - 68
- Two ESs not statistically different
- Comparable magnitude estimates
- One as percent of other
- Indulgence, common sense and mix
48Our Research Issues
- Deconstructing non-experiment--do experimental
and non-experimental ESs correspond differently
for R-D, for ITS, and for simple non-equivalent
designs? - How far can we generalize results about
invalidity of non-experiments beyond job
training? - Do these within-study comparison studies bear the
weight ascribed to them in evaluation policy at
DoL and IES?
49(No Transcript)