Threats to the Validity of Measures of Achievement Gains - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Threats to the Validity of Measures of Achievement Gains

Description:

Assumption of consistency is violated to varying degrees depending on features ... Valid inferences require consistency between inference and test weights ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 29

Provided by: laura361

Learn more at: https://www.marces.org

Category:

more less

Transcript and Presenter's Notes

Title: Threats to the Validity of Measures of Achievement Gains

1
Threats to the Validity of Measures of
Achievement Gains

Laura Hamilton and Daniel McCaffrey, RAND
Corporation
Daniel Koretz, Harvard University
November 8, 2005

2
Growth Measures are Becoming More Common in State
Accountability Systems

NCLB is primarily not a growth-based approach to
accountability, other than through safe harbor
Many states supplement NCLB with growth-based
measures
Californias Academic Performance Index
Massachusetts Performance and Improvement ratings
U.S. Department of Education has recently
expressed willingness to explore growth measures

3
Todays Presentation Examines Threats to Validity
of Growth Measures

Background How growth is measured
Framework for validating measures of change
Threats to validity
Dimensionality
Score inflation
Implications

4
Growth Metrics Come in Several Forms

Cohort to cohort (CTC)
E.g., the average for this years fifth graders
compared to last years fifth graders
Quasi-longitudinal
E.g., the average for this years fifth graders
compared to last years fourth graders
True longitudinal or individual growth (IG)
E.g., the average of the individual gains for
this years fifth graders

5
Individual Growth Models are Generally Preferred

Address problems stemming from changes in student
populations over time
Can yield biased estimates if students with
incomplete data are different from other students
Provide better information to inform decisions
about individual students or groups of students
CTC changes provides little information for
stable schools

6
All Growth Models Require Assumptions about
Consistency of Constructs Measured

Users of information from growth models assume
construct remains constant
For CTC models, nature of achievement and test
content in a single grade should not change
For IG models, nature of achievement and
constructs measured should not change as students
progress through school
Assumption of consistency is violated to varying
degrees depending on features of models, tests,
curriculum

7
Consistency is One Aspect of Validity

Validity applies to inferences, not just to tests
Growth modeling raises concerns about validity of
inferences about change
Need to understand what users infer from change
scores
These inferences might vary by group (e.g.,
parents, school administrators)
Match between what is inferred and what is
actually measured is critical to validity

8
Framework for Validating Measures of Change

Validation of change scores has focused mainly on
comparing trends between scores on two tests or
on correlations between alternate measures
These traditional approaches do not address
degree of match between tests or nonuniformity of
changes within a test
Koretz, McCaffrey, and Hamilton (2001) developed
a framework for validating tests under
high-stakes conditions, with a focus on measuring
change

9
Framework Addresses Nonuniformity of Gains Within
a Test

Test scores and inferences are considered in
terms of specific performance elements
Substantive elements represent the domain of
interest
Non-substantive elements are irrelevant to the
domain of interest
Performance elements are associated with weights
Weights are typically not explicit
Some may be unintentional
Validity requires close match between test
weights and inference weights

10
A Simple Linear Model for Test Scores

If we assume performance elements are additive,
the a students scores in year t is
where qjt denotes the students performance on
element j in year t and ljt is the test weight
The inference about a score assumes it is also a
weighted sum of elements but might use different
weights
Some weights can be zero

11
Several Factors Undermine Validity of Inferences
About Change

Changing nature of sample in CTC models
Differences in characteristics of students
included at different time points undermine
comparability
We do not address this problem here
Dimensionality Changes in performance elements
and their weights
Score inflation Special case of dimensionality
problem stemming from increases in scores that do
not match increases in achievement

12
Dimensionality

Tests typically assess multiple performance
elements
Test specifications or maps to standards provide
explicit information about performance elements
But implicit and unintended elements are also
likely to affect performance
We use the term dimensionality broadly to cover
all types of performance elements
Users inferences are also likely to be
multidimensional
Empirical unidimensionality is not sufficient to
conclude dimensionality is not a problem

13
Dimensionality Affects Inferences about
Influences on Achievement

Analyses of NELS88 math and science assessments
examine relationships among achievement, student
background, and school and classroom experiences
using subscales of achievement measure
For example, gender differences in science depend
on what is measured
Difference is larger on items that require
out-of-school knowledge or spatial reasoning
Focus on total score or on publisher-developed
test specifications masks this difference
Similar findings for relationships with other
student characteristics and school experiences

14
Dimensionality is Relevant to Value-Added Modeling

Subscales from a single mathematics achievement
test produce dramatically different results
Study used Procedures and Problem Solving
subscores from the Stanford Achievement Test
Variation within teachers across subscores was as
large as or larger than variation across teachers
Results suggest that decisions about teacher or
school effectiveness depend strongly on outcome
measure
Changes in weights given to subscores could
affect estimates of teacher or school
effectiveness

15
The Effects of Different Weightings of
Computation and Problem Solving Scores on Teacher
Effects
16
Threats Stem from Changing Performance Weights or
Mismatch with Inference Weights

Many performance elements are likely to be
inadvertent and non-substantive most measures of
change will not be fully aligned with users
inferences

17
Threats Stem from Changing Performance Weights or
Mismatch with Inference Weights

Sensitivity of test items to instruction is
likely to vary across grades and across
performance elements within the test, resulting
in changing weights and/or incorrect inferences
about educator effectiveness
When tests measure multiple elements, weights
that change over time can contribute to gain
scores independent of any gains on the
performance elements

18
Implications for CTC and IG Models Vary

Most CTC models use the same test or parallel
test forms from one year to the next
Test weights and inference weights will tend to
remain reasonably constant over time
But performance elements might differ in their
sensitivity to instruction
IG models face additional problem of changes in
dimensionality and instructional sensitivity
across grades
Problem is likely to be most severe for far-apart
grade levels and for subjects in which the
curriculum is not cumulative

19
Score Inflation

Score inflation refers to increases in test
scores that are not matched by increases in the
underlying achievement construct the test was
intended to measure
Score inflation represents a special case of
dimensionality-related problems

20
Score Inflation is Common in High-Stakes Testing
Contexts

Analyses of high-stakes test scores show gains in
those scores are not matched by gains on other
tests of the same content
Discrepancies in trends on high- and low-stakes
tests suggest gains on high-stakes tests do not
accurately reflect gains in the underlying
achievement the test was intended to measure

21
Example of Score Inflation
Mathematics test scores
Source Koretz, Linn, Dunbar, Shepard, 1991
22
Variation in Teachers Responses to Tests Leads
to Variation in Inflation

Teachers respond to high-stakes testing in ways
that are intended to maximize score increases
Placing more emphasis on tested topics than on
untested topics, even when the latter are
relevant to users inferences
Focusing on bubble kids (those just below the
cut score)
Coaching on item styles, prompts, or rubrics
(aspects of the test that are incidental to the
domain being tested)
Many of these actions inflate scores by producing
test-score gains that are larger than the gains
in the broader achievement domain

23
Recent Surveys Suggest Teachers Practices are
Influenced by Tests

Data from surveys of teachers in California,
Georgia, and Pennsylvania
Most teachers report increased focus on standards
and on content emphasized on tests
More than half of elementary teachers report
increasing time spent on test-taking strategies
Approximately 25 of teachers say they focus more
on students near the proficient cut score
Responses tend to be stronger in math than in
science

24
Score Inflation Exacerbates Inconsistencies in
Test and Inference Weights
25
Threats Stemming from Score Inflation

Problems arising from inflation are similar to
those arising from dimensionality
Occurs when students make substantial gains on
elements that might or might not have large
inference weights, but fail to make gains on
other elements that have high inference weights
Threatens the validity of inferences about gains
in achievement when achievement is measured using
high-stakes tests

26
Implications for CTC and IG Models

Most research on score inflation has focused on
CTC measures
Evidence suggests score inflation is large in the
first few years of test implementation but
eventually plateaus
Even if inflation lessens over time, inferences
about change should be limited to tested
material change scores provide no information
about untested material
IG models can be affected by variation in
inflation across grades plateau effects might
never occur

27
Improving the Validity of Inferences about Change

Users of test-score information need to recognize
that measuring change is not necessarily the same
as measuring growth
Test developers should make their measures as
resistant to inflation as possible
Future research should address dimensionality and
score inflation in the context of CTC and TL
measures

28
Summary

Test scores and inferences depend on multiple
performance elements
Valid inferences require consistency between
inference and test weights
Inconsistency implies that changes in scores
could be unrelated to the performance elements of
interest
Score inflation
CTC susceptible to errors from growth on
non-substantive or restricted set of elements
Effects likely to plateau
IG susceptible to changes in elements or content
across grades
Can have big impact on growth and related measures