Title: Change and Stability in Educational Assessment: an Oxymoron
1Change and Stability in Educational Assessment
an Oxymoron?
- Alina A. von Davier
- Educational Testing Service
- April 24th, 2008
- Institute of Educational Assessors
- National Conference
2Change in Educational Assessment
- There are many major educational assessments that
have been shaping the society for decades - Maintaining a test that is in sync with the
curriculum, shifting demographics, and the
societys fluid expectations requires adapting
the testing instruments - Examples of long term assessments SAT I, GRE
, TOEFL
3Stability in Educational Assessment
- The questions are
- Whether the testing instrument can be improved or
adapted without changing the meaning of the
reported scores - How to relate scores on the old and new versions
of the test (linking) - What claims the score linking might or might not
support
4Outline
- Present an overview of the educational assessment
transition process - Discuss statistical and policy implications of
the process components - Outline several techniques for investigating the
relationship between the test scores (linking)
when tests undergo major changes
5Frequently Asked Questions
- What are the consequences of change for score
meaning? - Why should we care about whether score meaning is
changed? How will that affect the decisions we
want to make? - How do we find out if score meaning has been
affected? - What can we do to preserve score meaning across
changes?
6Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
7Communication Accountability
- Accountability is the glue which holds the
examinations and testing system together David
Gee, from the Keynote Address at IEA in 2007 - The problem with communication ... is the
illusion that it has been accomplished. George
Bernard Shaw
8Communication Accountability(cont.)
- The Assessment Agency (AA) should communicate the
goals and requirements to the Testing Agency
(TA) - In turn, the TA should provide suggestions,
solutions, and risk assessment - The TA chooses the tools and processes and
validates them with the AA
9Running Example
- The AA plans a change in the administration mode
for a national end-of-course test - The change is to administer the test on computer
(CB) instead of using paper-and-pencil (PP) - The change was recommended by the experts in
education, the Board of the Assessment, etc. - The AA and a chosen TA begin the dialog
10Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
11Test Purpose
- The results of the test will be used for a
myriad of purposes ranging from helping
students to improve their learning, assessing
school performance, influencing house prices,
and, yes, even rating the effectiveness of
politicians. It is important to ensure that
the data outputs from assessment are not put to
too many diverse purposes.David Gee, from the
Keynote Address at IEA in 2007
12Test Purpose(cont.)
- The purposes of the existing test should be
carefully discussed by the AA and TA - An assessment should be constructed to support a
primary use - Additional purposes need to be explicit and
considered
13Test PurposeExamples
- Survey Assessments
- Group level reporting Low-stakes for the test
takers, high-stakes for states/countries - Summative/Achievement tests
- Individual reporting Precision at all
score/ability levels, including at cut-score(s)
High-stakes for the test takers and for
schools/districts - End of course tests
- Individual reporting Precision at all
score/ability levels and at cut-score(s)
High-stakes for the test takers
14Test Purpose(cont.)
- Licensure and certification tests
- Individual reporting Precision at cut-score(s)
High-stakes for the test takers - Placement or locator tests
- Individual reporting Age-independent tests
Multiple cut-scores Potentially high-stakes for
the test takers - Formative assessments
- Individual reporting Multiple cut-scores
Low-stakes for the test takers (increasing stakes
for the test takers)
15Running Example (cont.)
- The AA communicates the main purpose of the test
and describes it in detail - End of course test
- Individual reporting
- Precision at all ability levels and at
cut-score(s) - High-stakes for the test takers
- The AA communicates other uses of the test
- School accountability
- Teacher evaluation
16Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
17Change Purpose
- At some point early in the redesign process, the
organization must make a conscious decision
about what is most important in the test
revisionAll the revisions and data collections
should be guided by this redesign principle.
Liu Walker, 2007.
18Change Purpose(cont.)
- Administration mode
- Content
- Populations
- Format
19Running Example (cont.)
- The change is to administer the test on computer
instead of using paper-and-pencil - The other aspects of the tests should not change
20Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
21Change Specifications Claims Standard 4.16 of
(AERA, APA, NCME, 1999)
- If test specifications are changed from one
version of a test to a subsequent version, such
changes should be identified in the test manual,
and an indication should be given that converted
scores for the two versions may not be strictly
equivalent. When substantial changes in test
specifications occur, either scores should be
reported on a new scale or a clear statement
should be provided to alert users that the scores
are not directly comparable with those on earlier
versions of the test. Standards for Educational
and Psychological Testing
22Change Specifications Claims(cont.)
- Claims about the content comparability (experts
judgment, factor structure, correlations,
validity) - Claims about the scores comparability (equating,
linking, concordance, reliability, or standard
setting) - Equating It should be a matter of indifference
for a test taker which version of the test he or
she takes
23Change Specifications Claims(cont.)
- Claims about the scale scores
- The meaning of the reporting scale persists over
time even though norms change gt Rescaling? - New/Additional linkings to other existing tests
24Running Example (cont.)
- The content of the test should not change
- The test should measure the same construct
- The scores should be interchangeable it should
be a matter of indifference for a test taker
whether the test is CB or PP (equating) - The meaning of the scores should be preserved
because test users want to make the same
inferences (scaling) - The precision at all ability/score levels should
be maintained
25Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
26Constraints and Quality
- Cost
- Testing time
- Technology availability
- Security issues
- Item disclosure policy
- Inability to pretest items
27Constraints and Quality(cont.)
- Time for reporting scores
- Request for Constructed Response (CR) items
- Ability to mark the CR items
- Rater reliability, training, availability of
raters - Sample sizes for the special studies for the
tests - Accessibility (for, say, students with
disabilities or ESL)
28Running Example (cont.)
- The test will be taken on computers with no less
than 12inch monitors, the keyboard should be of a
standard size - The number of items displayed on one screen is
the same as the number of items on the test taken
on paper - A sufficiently large sample of motivated test
takers should be available for the special
studies - The computer-based test should be as reliable as
the paper-based test
29Running Example (cont.)
- The test should not be speeded
- The items should be as difficult as before
- All schools at the national level should have the
needed technology available - All students should have been trained to take a
test on a computer - Training materials have been made available
30Running Example (cont.)
- Ability to write and mark the open-ended
questions on the computer - The raters have received the proper training
- The ability to manage the data, merge the files,
clean the data on the computer - Special accommodations
31Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
32Tools Processes Overview
- Special Studies
- Data collection design Equity and Fairness
- Linking versus Rescaling
- Statistical methods and their assumptions
- Software
33Tools Processes Special Studies
- Data Collection
- Sample size
- Sample representativeness
- Attrition (in a CB vs. PP), one builds a
randomized assignment to one or another mode, but
some disappointed students assigned to PP might
drop out of the assessment Eignor, 2007) - Design choices counterbalanced, equivalent,
common items
34Tools Processes Special Studies (cont.)
- Equity and fairness
- Check the claims at item and test level such as
implementing DIF indices and equating population
invariance indices with respect to the background
variables of interest - Choose appropriate methods and check their
assumptions - Interpretation of results and recommendations
35Tools Processes Linking versus Rescaling
- Rescaling might be a psychometric issue, but
decisions about whether to rescale are seldom
made by psychometricians. R. Brennan, 2007,
p.174.
36Tools Processes Linking vs. Rescaling (cont.)
- When is linking not enough?
- Adapted example (inspired from Golub-Smith,
2005) - The scale for an achievement test was set many
years ago on the cohort that took the test at
that time - The mean of the scale was set to 400 for both
Verbal and Math - Meanwhile, the test-takers population and the
mode changed - The mean of the current norm is 320 for Verbal
and 470 on Math - So a test taker who gets score 400 on both Verbal
and Math today, is actually above the average on
Verbal and below the average on Math, though the
tradition may suggest that the person's score is
average - In a case like this psychometricians usually
suggest that a rescaling is needed
37Tools Processes Statistical Methods and Their
Assumptions
- To a man with a hammer any problem is a nail
- The solution should be chosen through a data
centered-approach rather than through a
particular hammer - The methods should support the claims
38Tools Processes Methods and Assumptions (cont.)
- Usually, the set of assumptions needed for a
particular method needs to be checked against the
constraints. - The TA needs to assess the risk of not meeting
one or more assumptions and needs to provide
alternative suggestions and methods
39Tools Processes Methods Classical Test
Theory-based
- Traditional linear and equipercentile methods
- Kernel equating method
- Diagnostic and accuracy methods
40Tools Processes Methods IRT-based
- The choice of the appropriate IRT models
- The choice of the estimation methods (related to
the software below) - The choice of linking method (concurrent
calibration, separate calibration with Stocking
Lord method) - The choice of equating (IRT true-score, IRT
observed-score, or linear transformation)
41Tools Processes High-quality software
- Appropriate for the chosen method
- Estimation methods
- License and availability
- Training and interface
42Running Example (cont.)
- Design A sufficiently large and representative
sample of motivated the test takers that should
take both versions of the test in different
orders counterbalanced design - Exploratory analyses correlations, distribution
characteristics, order effects analyses, factor
analyses, DIF - If the tests are similar in difficulty, if the
factorial structure is the same, if the items
function similarly, if there are no particular
order effects, if the tests are similarly
reliable, if there are no subgroup effects THEN
equating is possible
43Running Example (cont.)
- Which equating method?
- If possible, one would choose a method based on
the same principles of the scoring method for the
PP - Say classical test theory-based choose
appropriate models to fit the data choose an
equating function that a) preserves the same
pass/fail proportion and similar distributions at
the highest scores as in the previous
administrations, b) has the smallest standard
errors, c) shows no dependence on the subgroups
44Running Example (cont.)
- Investigate the scaling method
- If everything goes well, recommend the use of the
newly produced conversion table - If the analyses support the claims then the score
can be used interchangeably - In this case, no need to rescale
45Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
46Discussion
- The dialectic of the stability of the reporting
scale and continuous changes of various aspects
of the testing instruments of state or national
assessments is a potential basis for informative
analyses not only for the test practitioners, but
also for the policy makers.
47Discussion(cont.)
- Nowadays when more standardized testing is used
for assessing competencies in different domains
nationally and internationally, we are also
discovering more challenges in ensuring that the
process and the results are fair and accurate.