Title: Threats to the Validity of Measures of Achievement Gains
1Threats to the Validity of Measures of
Achievement Gains
- Laura Hamilton and Daniel McCaffrey, RAND
Corporation - Daniel Koretz, Harvard University
- November 8, 2005
2Growth Measures are Becoming More Common in State
Accountability Systems
- NCLB is primarily not a growth-based approach to
accountability, other than through safe harbor - Many states supplement NCLB with growth-based
measures - Californias Academic Performance Index
- Massachusetts Performance and Improvement ratings
- U.S. Department of Education has recently
expressed willingness to explore growth measures
3Todays Presentation Examines Threats to Validity
of Growth Measures
- Background How growth is measured
- Framework for validating measures of change
- Threats to validity
- Dimensionality
- Score inflation
- Implications
4Growth Metrics Come in Several Forms
- Cohort to cohort (CTC)
- E.g., the average for this years fifth graders
compared to last years fifth graders - Quasi-longitudinal
- E.g., the average for this years fifth graders
compared to last years fourth graders - True longitudinal or individual growth (IG)
- E.g., the average of the individual gains for
this years fifth graders
5Individual Growth Models are Generally Preferred
- Address problems stemming from changes in student
populations over time - Can yield biased estimates if students with
incomplete data are different from other students - Provide better information to inform decisions
about individual students or groups of students - CTC changes provides little information for
stable schools
6All Growth Models Require Assumptions about
Consistency of Constructs Measured
- Users of information from growth models assume
construct remains constant - For CTC models, nature of achievement and test
content in a single grade should not change - For IG models, nature of achievement and
constructs measured should not change as students
progress through school - Assumption of consistency is violated to varying
degrees depending on features of models, tests,
curriculum
7Consistency is One Aspect of Validity
- Validity applies to inferences, not just to tests
- Growth modeling raises concerns about validity of
inferences about change - Need to understand what users infer from change
scores - These inferences might vary by group (e.g.,
parents, school administrators) - Match between what is inferred and what is
actually measured is critical to validity
8Framework for Validating Measures of Change
- Validation of change scores has focused mainly on
comparing trends between scores on two tests or
on correlations between alternate measures - These traditional approaches do not address
degree of match between tests or nonuniformity of
changes within a test - Koretz, McCaffrey, and Hamilton (2001) developed
a framework for validating tests under
high-stakes conditions, with a focus on measuring
change
9Framework Addresses Nonuniformity of Gains Within
a Test
- Test scores and inferences are considered in
terms of specific performance elements - Substantive elements represent the domain of
interest - Non-substantive elements are irrelevant to the
domain of interest - Performance elements are associated with weights
- Weights are typically not explicit
- Some may be unintentional
- Validity requires close match between test
weights and inference weights
10A Simple Linear Model for Test Scores
- If we assume performance elements are additive,
the a students scores in year t is - where qjt denotes the students performance on
element j in year t and ljt is the test weight - The inference about a score assumes it is also a
weighted sum of elements but might use different
weights - Some weights can be zero
11Several Factors Undermine Validity of Inferences
About Change
- Changing nature of sample in CTC models
- Differences in characteristics of students
included at different time points undermine
comparability - We do not address this problem here
- Dimensionality Changes in performance elements
and their weights - Score inflation Special case of dimensionality
problem stemming from increases in scores that do
not match increases in achievement
12Dimensionality
- Tests typically assess multiple performance
elements - Test specifications or maps to standards provide
explicit information about performance elements - But implicit and unintended elements are also
likely to affect performance - We use the term dimensionality broadly to cover
all types of performance elements - Users inferences are also likely to be
multidimensional - Empirical unidimensionality is not sufficient to
conclude dimensionality is not a problem
13Dimensionality Affects Inferences about
Influences on Achievement
- Analyses of NELS88 math and science assessments
examine relationships among achievement, student
background, and school and classroom experiences
using subscales of achievement measure - For example, gender differences in science depend
on what is measured - Difference is larger on items that require
out-of-school knowledge or spatial reasoning - Focus on total score or on publisher-developed
test specifications masks this difference - Similar findings for relationships with other
student characteristics and school experiences
14Dimensionality is Relevant to Value-Added Modeling
- Subscales from a single mathematics achievement
test produce dramatically different results - Study used Procedures and Problem Solving
subscores from the Stanford Achievement Test - Variation within teachers across subscores was as
large as or larger than variation across teachers
- Results suggest that decisions about teacher or
school effectiveness depend strongly on outcome
measure - Changes in weights given to subscores could
affect estimates of teacher or school
effectiveness
15The Effects of Different Weightings of
Computation and Problem Solving Scores on Teacher
Effects
16Threats Stem from Changing Performance Weights or
Mismatch with Inference Weights
- Many performance elements are likely to be
inadvertent and non-substantive most measures of
change will not be fully aligned with users
inferences
17Threats Stem from Changing Performance Weights or
Mismatch with Inference Weights
- Sensitivity of test items to instruction is
likely to vary across grades and across
performance elements within the test, resulting
in changing weights and/or incorrect inferences
about educator effectiveness - When tests measure multiple elements, weights
that change over time can contribute to gain
scores independent of any gains on the
performance elements
18Implications for CTC and IG Models Vary
- Most CTC models use the same test or parallel
test forms from one year to the next - Test weights and inference weights will tend to
remain reasonably constant over time - But performance elements might differ in their
sensitivity to instruction - IG models face additional problem of changes in
dimensionality and instructional sensitivity
across grades - Problem is likely to be most severe for far-apart
grade levels and for subjects in which the
curriculum is not cumulative
19Score Inflation
- Score inflation refers to increases in test
scores that are not matched by increases in the
underlying achievement construct the test was
intended to measure - Score inflation represents a special case of
dimensionality-related problems
20Score Inflation is Common in High-Stakes Testing
Contexts
- Analyses of high-stakes test scores show gains in
those scores are not matched by gains on other
tests of the same content - Discrepancies in trends on high- and low-stakes
tests suggest gains on high-stakes tests do not
accurately reflect gains in the underlying
achievement the test was intended to measure
21Example of Score Inflation
Mathematics test scores
Source Koretz, Linn, Dunbar, Shepard, 1991
22Variation in Teachers Responses to Tests Leads
to Variation in Inflation
- Teachers respond to high-stakes testing in ways
that are intended to maximize score increases - Placing more emphasis on tested topics than on
untested topics, even when the latter are
relevant to users inferences - Focusing on bubble kids (those just below the
cut score) - Coaching on item styles, prompts, or rubrics
(aspects of the test that are incidental to the
domain being tested) - Many of these actions inflate scores by producing
test-score gains that are larger than the gains
in the broader achievement domain
23Recent Surveys Suggest Teachers Practices are
Influenced by Tests
- Data from surveys of teachers in California,
Georgia, and Pennsylvania - Most teachers report increased focus on standards
and on content emphasized on tests - More than half of elementary teachers report
increasing time spent on test-taking strategies - Approximately 25 of teachers say they focus more
on students near the proficient cut score - Responses tend to be stronger in math than in
science -
24Score Inflation Exacerbates Inconsistencies in
Test and Inference Weights
25Threats Stemming from Score Inflation
- Problems arising from inflation are similar to
those arising from dimensionality - Occurs when students make substantial gains on
elements that might or might not have large
inference weights, but fail to make gains on
other elements that have high inference weights - Threatens the validity of inferences about gains
in achievement when achievement is measured using
high-stakes tests
26Implications for CTC and IG Models
- Most research on score inflation has focused on
CTC measures - Evidence suggests score inflation is large in the
first few years of test implementation but
eventually plateaus - Even if inflation lessens over time, inferences
about change should be limited to tested
material change scores provide no information
about untested material - IG models can be affected by variation in
inflation across grades plateau effects might
never occur
27Improving the Validity of Inferences about Change
- Users of test-score information need to recognize
that measuring change is not necessarily the same
as measuring growth - Test developers should make their measures as
resistant to inflation as possible - Future research should address dimensionality and
score inflation in the context of CTC and TL
measures
28Summary
- Test scores and inferences depend on multiple
performance elements - Valid inferences require consistency between
inference and test weights - Inconsistency implies that changes in scores
could be unrelated to the performance elements of
interest - Score inflation
- CTC susceptible to errors from growth on
non-substantive or restricted set of elements - Effects likely to plateau
- IG susceptible to changes in elements or content
across grades - Can have big impact on growth and related measures