Equating And Scaling - PowerPoint PPT Presentation

1 / 53
About This Presentation

Equating And Scaling


If we gave both groups the same test, we could directly compare their ... Once we know that, we can then determine how much harder or easier one test is ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 54
Provided by: michaeln153


Transcript and Presenter's Notes

Title: Equating And Scaling

Equating And Scaling
The Goal of this Session Is
  • To give you a general idea of
  • What equating is
  • Why we need to do it
  • How it works
  • As part of this, we will discuss
  • Different equating models
  • A quick review of IRT
  • Scaling

Why Equate?
  • Why would the average raw scores on a test
    administered in 2006 and 2007 differ?

Why Equate?
  • Why would the average raw scores on a test in
    2006 and 2007 differ?
  • The 2006 form is harder than the 2007 form
  • This years students are better prepared than
    last years were
  • Both

Why Equate?
  • Equating allows us to determine the extent to
  • one test is harder than the other (which is
    usually the case)
  • one group is more able (i.e., has more of the
    construct of interest) than the other (also
    usually the case)
  • This enables us to ensure that well-prepared
    examinees get higher scores than the less
    well-prepared, regardless of the test they took

Why Equate?
  • If we gave both groups the same test, we could
    directly compare their performance, but this is
    not practical
  • Security
  • Release of items
  • This is where equating items come in a subset
    of items administered in both tests

Why Equate?
  • The performance on the equating items is used to
    compare student ability across the two groups
  • We can use this information to determine to what
    extent the difference in performance is due to
    one group being better prepared than the other
  • Once we know that, we can then determine how much
    harder or easier one test is than the other and
    adjust so that scores based on the two tests can
    be compared directly
  • ? the tests are equated

Equating Items
  • In order to ensure that this is done accurately,
    the equating items should have the following
  • Good psychometric properties
  • Be parallel to the overall test
  • Content
  • MC vs. CR items
  • Passage length
  • Graphics
  • etc.

Equating Items
  • The difficulty of the non-equating items in each
    test can vary -- within reason. However
  • We dont want the feel of the assessment to
    change differentially for different subgroups of
    students. As one example
  • If one test is harder than another, lower
    performing students may be more frustrated on the
    harder test
  • If one test is easier than another, higher
    performing students may be more bored and
    unmotivated on the easier test
  • Either of these could result in differences in
    performance on the two tests that are unrelated
    to the construct of interest

Equating Items
  • In addition, equating items should not be changed
    in any way from one administration to the next
  • Again, any change in the item (wording, location
    within the test, response options, etc.) can
    cause a change in student performance that is
    unrelated to the construct of interest

Equating Models
  • Classical test theory models (CTT)
  • Item response theory models (IRT)
  • Internal anchor (counts toward student scores)
  • External anchor (doesnt count toward student
  • Intact, separate anchor test
  • Embedded anchor test

Equating Models
  • CTT models are concerned with estimating the
    relationships of the anchor test with each total
    test, and the anchor test in group 1 with the
    anchor test in group 2.
  • IRT models focus on estimating the relationship
    of each item with the underlying trait (q) that
    is being measured.

Equating Models
Anchor 1
Anchor 2
Test 2
Test 1
Classical test theory equating diagram
  • Group 1 (2006)
  • Total test score 30.6
  • Score on equating items 14.2
  • Group 2 (2007)
  • Total test score 38.6
  • Score on equating items 15.5
  • Based on their performance on the equating items,
    we know that Group 2 is a bit higher performing,
    but their total score on the test is quite a bit
    higher which suggests that the 2007 test is

Equating Models
  • CTT models are well known, commonly used, and are
    relatively easy computationally
  • IRT models have a shorter history and are
    computationally difficult, but they have certain
    advantages that make their use desirable
  • At MP, we use pretty much exclusively IRT
    equating models

Basics of Item Response Theory
  • Why Use IRT?
  • Review of IRT
  • The Item Characteristic Curve (ICC)
  • The Test Characteristic Curve (TCC)

Why Use IRT?
  • Advantages over CTT
  • IRT allows us to calculate an estimate of student
    ability (q), not just observe how a particular
    student performs on a particular test
  • IRT uses the same theta scale to describe
    students and items this has certain advantages
  • It provides more sophisticated information that
    (depending on the specific model used) takes into
    consideration various characteristics of the item

  • Describes the interaction between examinees and
    test items
  • In the simplest case, ability is a function of
    item difficulty
  • As more sophisticated models are used, other item
    characteristics are taken into consideration as

The Basics
The Basics
Item Difficulty
Item Discrimination
Item Guessing
A Test is Made up of Many ICCs
A Test is Made up of Many ICCs
A Test is Made up of Many ICCs
A Test is Made up of Many ICCs
A Test is Made up of Many ICCs
A Test is Made up of Many ICCs
For a given examinee with ability (?) 1.0
For a given examinee with ability (?) 1.0
  • The expected score on the total test is equal to
    the sum of the probabilities for each item on the
    test 0.820.480.980.990.820.354.41

  • Summation of ICCs
  • Describes the relationship between ability and
    expected performance on the whole test

TCC is the sum of the ICCs
TCC is the sum of the ICCs
Is It Really That Simple?
  • Polytomous Items
  • Parameter Estimation
  • Item Parameters
  • Person Parameters
  • Various IRT Models
  • Examinee-Model Fit

So What Does This Do For Us?
  • Using the TCC, we can estimate the total test
    score for a student at a given level of ability
  • In actuality, however, this isnt what we want to
    do we already know the students total raw
    scores what we dont know is their ability.
  • Fortunately, once we have the ICCs and TCC, we
    can go the other way we can estimate ability
    based on a students observed total test score.

So What Does This Have to Do with Equating?
  • Back in 2006, we established the relationship
    between the total test and student ability using
    the theta scale
  • Using the equating items, we can put the 2007
    test on the same scale

How Do We Do This?
  • Estimate item parameters (i.e., calibrate the
    items) for 2006 test
  • Estimate item parameters for 2007 test, fixing
    the parameters for the equating items to their
    2006 values
  • This forces the ability estimates for 2007 to
    be on the same scale as those for 2006
  • As a result, we will get the same ability
    estimate for a student regardless of which test
    they took

2006 and 2007 TCCson the Same Scale
Typical Equating Process
  • Selecting Equating Items
  • IRT Calibrations/equating
  • Determining scores for reporting (scaling)

Selecting Equating Items
  • Initial Selection
  • Test questions from last years test are included
    in this years test
  • The total points from equating items should be at
    least 40 of the total points on the test
  • The distribution of the items across different
    relevant categories is similar to that of the
    whole test
  • Each item should be in about the same position
    this year and last year

Selecting Equating Items
  • We also do some statistical checks to look for
    items that are functioning very differently in
    2007 than they did in 2006, relative to the rest
    of the equating items
  • If we find those, we will exclude them from use
    as equating items

Item Calibrations
  • We talked about this earlier, remember?
  • Estimate parameters for 2006 items
  • Estimate parameters for 2007 items, fixing the
    values for the equating items
  • Voila the same ability estimate for students,
    regardless of which test they took!

  • It does not really make sense to report scores on
    the raw score metric
  • Equated raw scores do not equal the number of
    points the student achieved on that test, but
    rather the number of points that the student
    would be expected to achieve on the equated to

  • Similarly, it does not really make sense to
    report scores on the theta metric
  • While psychometricians are quite fond of theta
    scores, they have some unfortunate
    characteristics (decimal and negative values)
    that would make them alarming to most test users
  • (Note they in the previous sentence refers to
    the theta scores)

  • It does make sense to report scores on an
    arbitrary scale that has no inherent meaning.
  • The meaning of the scale is defined by the
  • Scaled scores are typically a linear
    transformation of ability estimates
  • Example of a linear transformation
  • (Ability x Slope) Intercept

  • This appears to be pretty simple, but, like most
    things, scaling is more complicated than it
    appears at first

Issues in Scaling
  • Endpoints
  • If one test is more difficult than the other, the
    highest possible raw score on the harder test
    ought to result in a higher scaled score than the
    top score on the easier test.
  • However, top bottom scores may be truncated so
    that a student who gets one or more items wrong
    may still receive the top scaled score, or a
    student who gets some items right may still
    receive the lowest scaled score.

Issues in Scaling
  • Number of points
  • Should be sufficient to differentiate examinees.
  • Should not be more than the number of raw score
  • Cut points
  • If more than two cut-points are used and each
    cutpoint is a pre-determined scaled score, the
    scale will be non-linear. In this case taking
    averages is questionable.

Issues in Scaling
  • Scale compression and/or expansion
  • If cut points are very close together on the
    theta scale and far apart on the scaled score
    scale, or vice versa
  • You can have compression in one part of the scale
    and expansion in another part

Determining Scaled Scores
Determining Scaled Scores
Determining Scaled Scores
Raw Score
Scaled Score
Write a Comment
User Comments (0)
About PowerShow.com