Change and Stability in Educational Assessment: an Oxymoron - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Change and Stability in Educational Assessment: an Oxymoron

Description:

1. Change and Stability in Educational Assessment: an Oxymoron? Alina A. von Davier ... There are many major educational assessments that have been shaping the ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 48

Provided by: avond

Category:

more less

Transcript and Presenter's Notes

Title: Change and Stability in Educational Assessment: an Oxymoron

1
Change and Stability in Educational Assessment
an Oxymoron?

Alina A. von Davier
Educational Testing Service
April 24th, 2008
Institute of Educational Assessors
National Conference

2
Change in Educational Assessment

There are many major educational assessments that
have been shaping the society for decades
Maintaining a test that is in sync with the
curriculum, shifting demographics, and the
societys fluid expectations requires adapting
the testing instruments
Examples of long term assessments SAT I, GRE
, TOEFL

3
Stability in Educational Assessment

The questions are
Whether the testing instrument can be improved or
adapted without changing the meaning of the
reported scores
How to relate scores on the old and new versions
of the test (linking)
What claims the score linking might or might not
support

4
Outline

Present an overview of the educational assessment
transition process
Discuss statistical and policy implications of
the process components
Outline several techniques for investigating the
relationship between the test scores (linking)
when tests undergo major changes

5
Frequently Asked Questions

What are the consequences of change for score
meaning?
Why should we care about whether score meaning is
changed? How will that affect the decisions we
want to make?
How do we find out if score meaning has been
affected?
What can we do to preserve score meaning across
changes?

6
Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
7
Communication Accountability

Accountability is the glue which holds the
examinations and testing system together David
Gee, from the Keynote Address at IEA in 2007
The problem with communication ... is the
illusion that it has been accomplished. George
Bernard Shaw

8
Communication Accountability(cont.)

The Assessment Agency (AA) should communicate the
goals and requirements to the Testing Agency
(TA)
In turn, the TA should provide suggestions,
solutions, and risk assessment
The TA chooses the tools and processes and
validates them with the AA

9
Running Example

The AA plans a change in the administration mode
for a national end-of-course test
The change is to administer the test on computer
(CB) instead of using paper-and-pencil (PP)
The change was recommended by the experts in
education, the Board of the Assessment, etc.
The AA and a chosen TA begin the dialog

10
Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
11
Test Purpose

The results of the test will be used for a
myriad of purposes ranging from helping
students to improve their learning, assessing
school performance, influencing house prices,
and, yes, even rating the effectiveness of
politicians. It is important to ensure that
the data outputs from assessment are not put to
too many diverse purposes.David Gee, from the
Keynote Address at IEA in 2007

12
Test Purpose(cont.)

The purposes of the existing test should be
carefully discussed by the AA and TA
An assessment should be constructed to support a
primary use
Additional purposes need to be explicit and
considered

13
Test PurposeExamples

Survey Assessments
Group level reporting Low-stakes for the test
takers, high-stakes for states/countries
Summative/Achievement tests
Individual reporting Precision at all
score/ability levels, including at cut-score(s)
High-stakes for the test takers and for
schools/districts
End of course tests
Individual reporting Precision at all
score/ability levels and at cut-score(s)
High-stakes for the test takers

14
Test Purpose(cont.)

Licensure and certification tests
Individual reporting Precision at cut-score(s)
High-stakes for the test takers
Placement or locator tests
Individual reporting Age-independent tests
Multiple cut-scores Potentially high-stakes for
the test takers
Formative assessments
Individual reporting Multiple cut-scores
Low-stakes for the test takers (increasing stakes
for the test takers)

15
Running Example (cont.)

The AA communicates the main purpose of the test
and describes it in detail
End of course test
Individual reporting
Precision at all ability levels and at
cut-score(s)
High-stakes for the test takers
The AA communicates other uses of the test
School accountability
Teacher evaluation

16
Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
17
Change Purpose

At some point early in the redesign process, the
organization must make a conscious decision
about what is most important in the test
revisionAll the revisions and data collections
should be guided by this redesign principle.
Liu Walker, 2007.

18
Change Purpose(cont.)

Administration mode
Content
Populations
Format

19
Running Example (cont.)

The change is to administer the test on computer
instead of using paper-and-pencil
The other aspects of the tests should not change

20
Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
21
Change Specifications Claims Standard 4.16 of
(AERA, APA, NCME, 1999)

If test specifications are changed from one
version of a test to a subsequent version, such
changes should be identified in the test manual,
and an indication should be given that converted
scores for the two versions may not be strictly
equivalent. When substantial changes in test
specifications occur, either scores should be
reported on a new scale or a clear statement
should be provided to alert users that the scores
are not directly comparable with those on earlier
versions of the test. Standards for Educational
and Psychological Testing

22
Change Specifications Claims(cont.)

Claims about the content comparability (experts
judgment, factor structure, correlations,
validity)
Claims about the scores comparability (equating,
linking, concordance, reliability, or standard
setting)
Equating It should be a matter of indifference
for a test taker which version of the test he or
she takes

23
Change Specifications Claims(cont.)

Claims about the scale scores
The meaning of the reporting scale persists over
time even though norms change gt Rescaling?
New/Additional linkings to other existing tests

24
Running Example (cont.)

The content of the test should not change
The test should measure the same construct
The scores should be interchangeable it should
be a matter of indifference for a test taker
whether the test is CB or PP (equating)
The meaning of the scores should be preserved
because test users want to make the same
inferences (scaling)
The precision at all ability/score levels should
be maintained

25
Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
26
Constraints and Quality

Cost
Testing time
Technology availability
Security issues
Item disclosure policy
Inability to pretest items

27
Constraints and Quality(cont.)

Time for reporting scores
Request for Constructed Response (CR) items
Ability to mark the CR items
Rater reliability, training, availability of
raters
Sample sizes for the special studies for the
tests
Accessibility (for, say, students with
disabilities or ESL)

28
Running Example (cont.)

The test will be taken on computers with no less
than 12inch monitors, the keyboard should be of a
standard size
The number of items displayed on one screen is
the same as the number of items on the test taken
on paper
A sufficiently large sample of motivated test
takers should be available for the special
studies
The computer-based test should be as reliable as
the paper-based test

29
Running Example (cont.)

The test should not be speeded
The items should be as difficult as before
All schools at the national level should have the
needed technology available
All students should have been trained to take a
test on a computer
Training materials have been made available

30
Running Example (cont.)

Ability to write and mark the open-ended
questions on the computer
The raters have received the proper training
The ability to manage the data, merge the files,
clean the data on the computer
Special accommodations

31
Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
32
Tools Processes Overview

Special Studies
Data collection design Equity and Fairness
Linking versus Rescaling
Statistical methods and their assumptions
Software

33
Tools Processes Special Studies

Data Collection
Sample size
Sample representativeness
Attrition (in a CB vs. PP), one builds a
randomized assignment to one or another mode, but
some disappointed students assigned to PP might
drop out of the assessment Eignor, 2007)
Design choices counterbalanced, equivalent,
common items

34
Tools Processes Special Studies (cont.)

Equity and fairness
Check the claims at item and test level such as
implementing DIF indices and equating population
invariance indices with respect to the background
variables of interest
Choose appropriate methods and check their
assumptions
Interpretation of results and recommendations

35
Tools Processes Linking versus Rescaling

Rescaling might be a psychometric issue, but
decisions about whether to rescale are seldom
made by psychometricians. R. Brennan, 2007,
p.174.

36
Tools Processes Linking vs. Rescaling (cont.)

When is linking not enough?
Adapted example (inspired from Golub-Smith,
2005)
The scale for an achievement test was set many
years ago on the cohort that took the test at
that time
The mean of the scale was set to 400 for both
Verbal and Math
Meanwhile, the test-takers population and the
mode changed
The mean of the current norm is 320 for Verbal
and 470 on Math
So a test taker who gets score 400 on both Verbal
and Math today, is actually above the average on
Verbal and below the average on Math, though the
tradition may suggest that the person's score is
average
In a case like this psychometricians usually
suggest that a rescaling is needed

37
Tools Processes Statistical Methods and Their
Assumptions

To a man with a hammer any problem is a nail
The solution should be chosen through a data
centered-approach rather than through a
particular hammer
The methods should support the claims

38
Tools Processes Methods and Assumptions (cont.)

Usually, the set of assumptions needed for a
particular method needs to be checked against the
constraints.
The TA needs to assess the risk of not meeting
one or more assumptions and needs to provide
alternative suggestions and methods

39
Tools Processes Methods Classical Test
Theory-based

Traditional linear and equipercentile methods
Kernel equating method
Diagnostic and accuracy methods

40
Tools Processes Methods IRT-based

The choice of the appropriate IRT models
The choice of the estimation methods (related to
the software below)
The choice of linking method (concurrent
calibration, separate calibration with Stocking
Lord method)
The choice of equating (IRT true-score, IRT
observed-score, or linear transformation)

41
Tools Processes High-quality software

Appropriate for the chosen method
Estimation methods
License and availability
Training and interface

42
Running Example (cont.)

Design A sufficiently large and representative
sample of motivated the test takers that should
take both versions of the test in different
orders counterbalanced design
Exploratory analyses correlations, distribution
characteristics, order effects analyses, factor
analyses, DIF
If the tests are similar in difficulty, if the
factorial structure is the same, if the items
function similarly, if there are no particular
order effects, if the tests are similarly
reliable, if there are no subgroup effects THEN
equating is possible

43
Running Example (cont.)

Which equating method?
If possible, one would choose a method based on
the same principles of the scoring method for the
PP
Say classical test theory-based choose
appropriate models to fit the data choose an
equating function that a) preserves the same
pass/fail proportion and similar distributions at
the highest scores as in the previous
administrations, b) has the smallest standard
errors, c) shows no dependence on the subgroups

44
Running Example (cont.)

Investigate the scaling method
If everything goes well, recommend the use of the
newly produced conversion table
If the analyses support the claims then the score
can be used interchangeably
In this case, no need to rescale

45
Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
46
Discussion

The dialectic of the stability of the reporting
scale and continuous changes of various aspects
of the testing instruments of state or national
assessments is a potential basis for informative
analyses not only for the test practitioners, but
also for the policy makers.

47
Discussion(cont.)

Nowadays when more standardized testing is used
for assessing competencies in different domains
nationally and internationally, we are also
discovering more challenges in ensuring that the
process and the results are fair and accurate.

Write a Comment

User Comments (0)