Equating And Scaling - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Equating And Scaling

Description:

If we gave both groups the same test, we could directly compare their ... Once we know that, we can then determine how much harder or easier one test is ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 54
Provided by: michaeln153
Category:

less

Transcript and Presenter's Notes

Title: Equating And Scaling


1
Equating And Scaling
2
The Goal of this Session Is
  • To give you a general idea of
  • What equating is
  • Why we need to do it
  • How it works
  • As part of this, we will discuss
  • Different equating models
  • A quick review of IRT
  • Scaling

3
Why Equate?
  • Why would the average raw scores on a test
    administered in 2006 and 2007 differ?

4
Why Equate?
  • Why would the average raw scores on a test in
    2006 and 2007 differ?
  • The 2006 form is harder than the 2007 form
  • This years students are better prepared than
    last years were
  • Both

5
Why Equate?
  • Equating allows us to determine the extent to
    which
  • one test is harder than the other (which is
    usually the case)
  • one group is more able (i.e., has more of the
    construct of interest) than the other (also
    usually the case)
  • This enables us to ensure that well-prepared
    examinees get higher scores than the less
    well-prepared, regardless of the test they took

6
Why Equate?
  • If we gave both groups the same test, we could
    directly compare their performance, but this is
    not practical
  • Security
  • Release of items
  • This is where equating items come in a subset
    of items administered in both tests

7
Why Equate?
  • The performance on the equating items is used to
    compare student ability across the two groups
  • We can use this information to determine to what
    extent the difference in performance is due to
    one group being better prepared than the other
  • Once we know that, we can then determine how much
    harder or easier one test is than the other and
    adjust so that scores based on the two tests can
    be compared directly
  • ? the tests are equated

8
Equating Items
  • In order to ensure that this is done accurately,
    the equating items should have the following
    characteristics
  • Good psychometric properties
  • Be parallel to the overall test
  • Content
  • MC vs. CR items
  • Passage length
  • Graphics
  • etc.

9
Equating Items
  • The difficulty of the non-equating items in each
    test can vary -- within reason. However
  • We dont want the feel of the assessment to
    change differentially for different subgroups of
    students. As one example
  • If one test is harder than another, lower
    performing students may be more frustrated on the
    harder test
  • If one test is easier than another, higher
    performing students may be more bored and
    unmotivated on the easier test
  • Either of these could result in differences in
    performance on the two tests that are unrelated
    to the construct of interest

10
Equating Items
  • In addition, equating items should not be changed
    in any way from one administration to the next
  • Again, any change in the item (wording, location
    within the test, response options, etc.) can
    cause a change in student performance that is
    unrelated to the construct of interest

MUST!
11
Equating Models
  • Classical test theory models (CTT)
  • Item response theory models (IRT)
  • Internal anchor (counts toward student scores)
  • External anchor (doesnt count toward student
    scores)
  • Intact, separate anchor test
  • Embedded anchor test

12
Equating Models
  • CTT models are concerned with estimating the
    relationships of the anchor test with each total
    test, and the anchor test in group 1 with the
    anchor test in group 2.
  • IRT models focus on estimating the relationship
    of each item with the underlying trait (q) that
    is being measured.

13
Equating Models
Difficulty
Difficulty
Ability
Anchor 1
Anchor 2
Test 2
Test 1
Classical test theory equating diagram
14
Example
  • Group 1 (2006)
  • Total test score 30.6
  • Score on equating items 14.2
  • Group 2 (2007)
  • Total test score 38.6
  • Score on equating items 15.5
  • Based on their performance on the equating items,
    we know that Group 2 is a bit higher performing,
    but their total score on the test is quite a bit
    higher which suggests that the 2007 test is
    easier.

15
Equating Models
  • CTT models are well known, commonly used, and are
    relatively easy computationally
  • IRT models have a shorter history and are
    computationally difficult, but they have certain
    advantages that make their use desirable
  • At MP, we use pretty much exclusively IRT
    equating models

16
Basics of Item Response Theory
  • Why Use IRT?
  • Review of IRT
  • The Item Characteristic Curve (ICC)
  • The Test Characteristic Curve (TCC)

17
Why Use IRT?
  • Advantages over CTT
  • IRT allows us to calculate an estimate of student
    ability (q), not just observe how a particular
    student performs on a particular test
  • IRT uses the same theta scale to describe
    students and items this has certain advantages
  • It provides more sophisticated information that
    (depending on the specific model used) takes into
    consideration various characteristics of the item

18
The ICC
  • Describes the interaction between examinees and
    test items
  • In the simplest case, ability is a function of
    item difficulty
  • As more sophisticated models are used, other item
    characteristics are taken into consideration as
    well

19
The Basics
20
The Basics
21
Item Difficulty
22
Item Discrimination
23
Item Guessing
24
A Test is Made up of Many ICCs
25
A Test is Made up of Many ICCs
26
A Test is Made up of Many ICCs
27
A Test is Made up of Many ICCs
28
A Test is Made up of Many ICCs
29
A Test is Made up of Many ICCs
30
For a given examinee with ability (?) 1.0
31
For a given examinee with ability (?) 1.0
  • The expected score on the total test is equal to
    the sum of the probabilities for each item on the
    test 0.820.480.980.990.820.354.41

32
The TCC
  • Summation of ICCs
  • Describes the relationship between ability and
    expected performance on the whole test

33
TCC is the sum of the ICCs
34
TCC is the sum of the ICCs
35
Is It Really That Simple?
  • Polytomous Items
  • Parameter Estimation
  • Item Parameters
  • Person Parameters
  • Various IRT Models
  • Examinee-Model Fit

36
So What Does This Do For Us?
  • Using the TCC, we can estimate the total test
    score for a student at a given level of ability
  • In actuality, however, this isnt what we want to
    do we already know the students total raw
    scores what we dont know is their ability.
  • Fortunately, once we have the ICCs and TCC, we
    can go the other way we can estimate ability
    based on a students observed total test score.

37
So What Does This Have to Do with Equating?
  • Back in 2006, we established the relationship
    between the total test and student ability using
    the theta scale
  • Using the equating items, we can put the 2007
    test on the same scale

38
How Do We Do This?
  • Estimate item parameters (i.e., calibrate the
    items) for 2006 test
  • Estimate item parameters for 2007 test, fixing
    the parameters for the equating items to their
    2006 values
  • This forces the ability estimates for 2007 to
    be on the same scale as those for 2006
  • As a result, we will get the same ability
    estimate for a student regardless of which test
    they took

39
2006 and 2007 TCCson the Same Scale
40
Typical Equating Process
  • Selecting Equating Items
  • IRT Calibrations/equating
  • Determining scores for reporting (scaling)

41
Selecting Equating Items
  • Initial Selection
  • Test questions from last years test are included
    in this years test
  • The total points from equating items should be at
    least 40 of the total points on the test
  • The distribution of the items across different
    relevant categories is similar to that of the
    whole test
  • Each item should be in about the same position
    this year and last year

42
Selecting Equating Items
  • We also do some statistical checks to look for
    items that are functioning very differently in
    2007 than they did in 2006, relative to the rest
    of the equating items
  • If we find those, we will exclude them from use
    as equating items

43
Item Calibrations
  • We talked about this earlier, remember?
  • Estimate parameters for 2006 items
  • Estimate parameters for 2007 items, fixing the
    values for the equating items
  • Voila the same ability estimate for students,
    regardless of which test they took!

44
Scaling
  • It does not really make sense to report scores on
    the raw score metric
  • Equated raw scores do not equal the number of
    points the student achieved on that test, but
    rather the number of points that the student
    would be expected to achieve on the equated to
    test

45
Scaling
  • Similarly, it does not really make sense to
    report scores on the theta metric
  • While psychometricians are quite fond of theta
    scores, they have some unfortunate
    characteristics (decimal and negative values)
    that would make them alarming to most test users
  • (Note they in the previous sentence refers to
    the theta scores)

46
Scaling
  • It does make sense to report scores on an
    arbitrary scale that has no inherent meaning.
  • The meaning of the scale is defined by the
    assessment
  • Scaled scores are typically a linear
    transformation of ability estimates
  • Example of a linear transformation
  • (Ability x Slope) Intercept

47
Scaling
  • This appears to be pretty simple, but, like most
    things, scaling is more complicated than it
    appears at first

48
Issues in Scaling
  • Endpoints
  • If one test is more difficult than the other, the
    highest possible raw score on the harder test
    ought to result in a higher scaled score than the
    top score on the easier test.
  • However, top bottom scores may be truncated so
    that a student who gets one or more items wrong
    may still receive the top scaled score, or a
    student who gets some items right may still
    receive the lowest scaled score.

49
Issues in Scaling
  • Number of points
  • Should be sufficient to differentiate examinees.
  • Should not be more than the number of raw score
    points.
  • Cut points
  • If more than two cut-points are used and each
    cutpoint is a pre-determined scaled score, the
    scale will be non-linear. In this case taking
    averages is questionable.

50
Issues in Scaling
  • Scale compression and/or expansion
  • If cut points are very close together on the
    theta scale and far apart on the scaled score
    scale, or vice versa
  • You can have compression in one part of the scale
    and expansion in another part

51
Determining Scaled Scores
52
Determining Scaled Scores
53
Determining Scaled Scores
Raw Score
Scaled Score
Write a Comment
User Comments (0)
About PowerShow.com