Approaches to Evaluating Test Items - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

Approaches to Evaluating Test Items

Description:

Not-so-traditional approaches to evaluating not-so-traditional test items ... iri is a point-biserial correlation between the test item and the criterion ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 69
Provided by: ipac6
Category:

less

Transcript and Presenter's Notes

Title: Approaches to Evaluating Test Items


1
Approaches to Evaluating Test Items
Traditional
and
Not-So-Traditional
Training Workshop
FALL MAPAC MEETING TECHNICAL ISSUES IN
SELECTION JOHN JAY COLLEGE OF CRIMINAL
JUSTICE THE CITY UNIVERSITY OF NEW YORK NEW YORK
CITY SEPTEMBER 29 OCTOBER 1, 1999
2
TOPICS
  • Traditional methods of evaluating the
    effectiveness of test items
  • Multiple-Choice Test Items
  • Not-so-traditional approaches to evaluating
    not-so-traditional test items
  • Short Answer Essay Test Items
  • Written Simulation Exercise Items

3
Considerations
  • What are the issues to consider when evaluating
    the effectiveness of test items in an
    examination?
  • What is a test item designed to do and how do we
    know when a test item is not doing its job?
  • What are good item analysis results and what
    does good item analysis look like?
  • Do good test items always generate good item
    analysis results?
  • Does bad item analysis indicate that the item
    was bad?

4
The Purpose of the Test Item
  • To make distinctions!!
  • Nothing More and Nothing Less.

5
Item Analysis -
  • A basic tool for evaluating whether a test item
    is performing this function, i.e., is the item
    making distinctions?
  • Item analysis is a general term for a variety of
    statistical methods of evaluating items.
  • For Multiple-Choice test items, item analysis
    generally involves a tabular presentation of the
    way candidates answered a given item.

6
Item Analysis
Some basic definitions
  • P item difficulty or more accurately, the of
    candidates getting the answer correct.
  • H think of it as a grouping of high(er) scoring
    candidates taking the test
  • L think of it is as a grouping of low(er)
    scoring candidates taking the test
  • H L is typically a median split around a
    criterion measure, e.g., job performance,
    total-test score, part-test score, etc.

7
Item Analysis
  • iri is a point-biserial correlation between the
    test item and the criterion measure multiplied by
    the items standard deviation
  • In English, iri is the degree to which the item
    is measuring the same thing as the criterion
    measure. Iri is sometimes referred to as item
    reliability index or item discrimination index
  • iris between .10 and .20 indicate that the item
    is working in the same direction as the other
    items
  • iris above .20 are working very well, such as in
    the above example

8
A momentary side-barjust how does one compute
the iri anyway and why should we know how it is
computed?
Where Pg Percent of group getting the item
correct subtest mean for candidates in
groups who got the item correct
subtest mean for all candidates subtest
standard deviation for all candidates item
Standard Deviation
9
Iri Estimate Formula
  • It is not often easy to determine
    subtest mean for candidates in group who got
    the item correct
  • The estimation formula does not replicate
    calculations, but can be used as a quick estimate

10
Using item analysis, you can tell a lot about an
item, without even looking at the test item
itselfBUT, BE CAREFUL...
  • And the answer key for item (1) is.?
  • Does the item appear to be fulfilling its lifes
    mission, i.e., to make distinctions?
  • How are the wrong answer choices (distracters)
    working?
  • But why the cautionary, be careful?
  • Handout (A) Example of two item analysis reports

11
A Look at Item Analysis --Sans the Item
  • And the answer key for item 8 is .?
  • Does the item appear to be fulfilling its lifes
    mission, i.e., to make distinctions?
  • How are the wrong answer choices (distracters)
    working?
  • But why the cautionary, be careful?
  • Now what would you think if I told you that item
    8 was mis-keyed. The correct key for the item is
    D not A.

12
A Look at Item Analysis --Sans the Item
  • Now, based again on only the item analysis, and
    the answer key for item 8 is.?
  • Does the item appear to be fulfilling its lifes
    mission, i.e., to make distinctions?
  • How are the wrong answer choices (distracters)
    working?
  • Therefore, the cautionary, be careful.

13
A Look at Item Analysis --Sans the Item
  • Too heavy a trust upon a review of the item
    analysis alone for the purposes of evaluating a
    test item is fraught with danger.sounds obvious,
    but you might be surprised
  • Due to the computational formulas used for iri,
    high-low numbers are strongly biased in favor of
    the selected key. In most cases a change in the
    identified key will make a significant difference
    in the high-low output for that particular item,
    while not significantly changing high-low output
    for other items in the subtest.

14
Enough Said.
Therefore, just be careful when looking at
numbers. Numbers alone may deceive you...
15
An Item Doesnt Exist in a Vacuum
Like most other things in life and nature, an
item must be evaluated in its context.
  • The other items in the test or subtest grouping.
  • Is it a homogeneous grouping of items that is
    intended to measure the same trait?
  • Or is it a heterogeneous grouping of items that
    is designed to measure a broad range of traits or
    knowledges?
  • Answers to these questions will suggest what type
    of iri (item reliability index) one would expect
    to get...

16
An Item Doesnt Exist in a Vacuum
Like most other things in life and nature, an
item must be evaluated in its context.
  • The candidates who are in the competition that
    are responding to the items, is it a homogeneous
    group or a heterogeneous group?
  • Is it a homogeneous grouping of candidates such
    as a promotional candidate field having the same
    or similar expertise in the trait being measured?
  • Or is it a heterogeneous grouping of candidates
    such as a sample of high school graduates having
    a varying degree of expertise in the trait being
    measured?
  • Again, answers to these questions will suggest
    what type of iri (item reliability index) one
    would expect to get...

17
Enough with the Introduction Stuff, Let's get to
work
  • How can item analysis be used to evaluate whether
    or not test items have differential item
    functioning or have impact?
  • For purposes of this exercise the referent
    differential item functioning or impact at
    the item level is defined as a finding of
    observable differences between levels of
    difficulty among gender and/or ethnic groups.

18
Handout (B) - Two Approaches for Consideration
  • One to be used as an indicator of difference
    think barometer
  • The other as an indicator of significance in
    differences.
  • Method One - Using Difference in Item Mean
    Difficulties to Evaluate Score Differences
    between Candidate Groups
  • Method Two - The Chi Square Approach to Evaluate
    Score Differences between Candidate Groups

19
Method One
  • Refer to Handout (B) Attachments B-1, B-2 and B-3

Step 1
  • For each item, compute the average item
    difficulty for Whites (PW) and Blacks (PB).
  • (1PW ?, 1PB ? 2PW ?, 2PB ?.etc.)
  • Example - item 1
  • 1PW 0.63
  • 1PB 0.53

20
Method One (continued)
Step 2
  • Refer to Handout (B) Attachment B-3

Subtract the item difficulties for Whites and
Blacks for each item to obtain an average ethnic
difference rating between Whites and Blacks for
each item (itemEDIF).
  • Example - item 1
  • 1EDIF 1PW - 1PB
  • 1EDIF 0.63 - 0.53
  • 1EDIF 0.10
  • Example - item 2
  • 2EDIF 2PW - 2PB
  • 2EDIF 0.82 - 0.60
  • 2EDIF 0.22

21
Method One (continued)
Step 3
  • Refer to Handout (B) Attachment B-3

Compute the mean item difficulty level for all
items in the subtest for Whites ( W ) then
compute the mean item difficulty level for all
items in the subtest for Blacks ( B).
  • Example
  • W 9.42/15 0.628 mean item difficulty
    for Whites
  • B 7.57/15 0.505 mean item difficulty
    for Blacks

22
Method One (continued)
Step 4
  • Refer to Handout (B) Attachment B-3

Compute an ethnic group difference mean statistic
( EGDIF) by subtracting the subtest mean
difficulty level for the Blacks from the subtest
mean difficulty level for the Whites.
  • Example
  • EGDIF ( W) - ( B)
  • EGDIF 0.628 - 0.505
  • EGDIF - 0.123 the mean difference
    between White and Black candidates

23
Now, go back to your computations in Step 2
Step 5
  • Flag any item where its ethnic item difference
    statistic (itemEDIF) exceeds the ethnic group
    difference mean statistic computed in Step 4 (
    EGDIF).
  • Example Item 1 does not exceed the ethnic group
    difference mean statistic, but items 2, 6, 7, 8,
    9, 13, 14 and 15 do...

24
Assumptions and Implications
  • This method of identifying items which have
    apparent ethnic differences in results allows any
    existing main effect differences between groups
    to remain.
  • This method defines differential item functioning
    as any difference greater than existing main
    effect differences.
  • This method assumes that the main effect
    differences may reflect real underlying
    differences between groups rather than
    differential item functioning.
  • This method does not allow for error variance,
    i.e., the possibility that flagged items would
    differ upon readministration.

25
Method One continued...
  • Now what do we do?
  • This method of evaluating potential differential
    item functioning or impact can be viewed as a
    barometer.the higher the number, the greater the
    pressure.
  • Now you must look at the test items to attempt to
    determine what is contributing to these results
  • It would be good to invite a sensitivity review
    by representatives of the protected classes and
    ask them to suggest reasons why DIF appears to be
    operating in the item.
  • If you are pre-testing your questions, you may
    wish to exclude some items from the examination
    that seriously exceed average differences.

26
Method Two - The Chi Square Approach to Evaluate
Score Differences between Candidate Groups
  • Simply described, Chi Square is a nifty little
    tool that evaluates the significance of the
    differences in response patterns between what we
    would expect (if in fact there were no
    differences between groups) and what we observe.

27
Method Two continued...
Example Let's look at Handout Attachment B-1
and B-2 .specifically the item analysis for item
1
No. Whites 766
  • What is the frequency of White Candidates
    selecting the correct answer?
  • 484 (simply add High and Low candidates on A)

28
Method Two continued...
No. Whites 766
  • What is the of White Candidates selecting the
    correct answer?
  • 63
  • What is the of White Candidates NOT selecting
    the correct answer?
  • 37

29
Method Two continued...
  • Now, if there were no differences in the response
    patterns between White and Black candidates on
    this item, we would expect that there would be
    approximately 63 of the Black candidates or 92
    out of 146 candidates selecting the correct
    answer and approximately 37 of the Black
    candidates or 54 our of 146 getting the item
    wrongthat is what we would expect, but what did
    we observe?

30
Method Two continued...
No. Blacks 146
  • What is the frequency of Black candidates
    selecting the correct answer?
  • 77 (simply add High and Low candidates on A)
  • What is the of Black candidates selecting the
    correct answer?
  • 53

31
Method Two continued...
No. Blacks 146
  • What is the frequency of Black candidates NOT
    selecting the correct answer?
  • 69 (simply subtract the total number of Black
    candidates selecting the key answer from the
    total number of Black candidates)
  • What is the of Black candidates NOT selecting
    the correct answer?
  • 47

Is the observed key and non-key response pattern
difference of 10 between White and Black
candidates significant?
32
Chi-Square
Is relatively easy to compute and can help us get
a handle on the answer.
  • While viewing the steps that follow, keep in mind
    that the intention is to demonstrate how the Chi
    Square technique can be used, NOT how to compute
    the Chi Square statistical technique....that
    would be another workshop...

Chi Square Formula
Where fo Observed Frequency and fe
Expected Frequency
33
Chi-Square
  • Remember, when thinking about our question, we
    are comparing the key and non-key response
    pattern of the Black candidates in this case, to
    the key and non-key response pattern of White
    candidates
  • Therefore, what we are summing ( ) is the
    difference in response patterns of Black and
    White candidates selecting the key and the
    response patterns of Black and White candidates
    NOT selecting the key.

34
Let's recap the numbers and start to fill in the
blanks...
  • If there were no differences in response patterns
    between White and Black candidate groups, what
    would be the Expected Frequency (fe) for Blacks
    selecting the key?... 92 (0.63 146) and what
    would be the Expected Frequency (fe) for Blacks
    NOT selecting the key? 54 (0.37 146)
  • What is the frequency of Black candidates
    selecting the correct answer?
  • 77 (simply add High and Low candidates on A)
  • What is the frequency of Black candidates NOT
    selecting the correct answer?
  • 69 (simply subtract the total number of Black
    candidates selecting the key answer from the
    total number of Black candidates)

35
Let's recap the numbers and start to fill in the
blanks...
36
Method Two continued...
  • To enter the X2 table of significance, we need
    to determine the degrees of freedom (df)in the
    two by two Chi Square table that we are
    using.. df ( rows - 1) ( columns -1) or
    (1)(1) or 1
  • The tabled X2 at the .05 level of significance
    and 1 df 3.841
  • The obtained X2 6.613 is greater than 3.841
    and therefore we would reject the null hypotheses
    that there is no difference between the response
    patterns of Whites and Black to this item.

37
Think back to Method One (difference in item mean
difficulty).
  • Using that method, item one was not flaggednow,
    using Method Two (Chi Square test of
    significance) we find that there is a difference
    in response patterns between White and Black
    candidates
  • Why?

38
Remember the assumptions for Method One?
  • This method of identifying items which have
    apparent ethnic differences in results allows any
    existing main effect differences between groups
    to remain.
  • This method defines differential item functioning
    as any difference greater than existing main
    effect differences.
  • This method assumes that the main effect
    differences may reflect real underlying
    differences between groups rather than
    differential item functioning.

The Chi Square method treats differential item
functioning as a significant difference in
performance between the groups.
39
Where do we go from here?
Let's look at some non-multiple choice item
formats
40
Let's look at ...
Item Analysis for Short-Answer Essay Format Items
  • Before we look at the item analysis for essays,
    let's first look at a short-answer essay items
  • Correction Captain Essay Items

41
Rating Standards
42
Reminderas though we need them
  • Item analysis generally involves a tabular
    presentation of the way candidates answered a
    given item
  • The purpose of a test is to make distinctions
  • The question remains, How do we know when a test
    item is doing its job, be it multiple choice
    format or essay format?
  • Let's make a first pass at attempting to answer
    this question for a set of Short-Answer Essay
    format items...

43
Descriptive Statistics
In-Basket Item Means Using Original Rating
Standards - in Descending Level of Difficulty
44
Let's look at Item Analysis...
for a few specific short-answer essay items
Remember, item analysis generally involves a
tabular presentation of the way candidates
answered a given item
45
Remember what we said about the purpose of a test
item?
  • To Make Distinctions
  • A review of the tabular presentation of results
    will help us assess whether or not the essay item
    is fulfilling its lifes mission
  • Let's take a look...

46
Crosstab Report for Item 2
  • Refer to Handout (C) Attachment C-1

47
Crosstab Report for Item 8X
  • Refer to Handout (C) Attachment C-1

48
Rating continued...
  • A comparison of the candidate performance
    reflected in each of the item crosstab tables
    strongly supports an argument that the original
    standards were too high for this candidate group.
  • Therefore, a change for a criterion based
    rating standard to a performance based standard
    was considered.
  • In this context, the referent criterion is used
    to reflect the fact that the candidates were
    rated against criteria established by our Subject
    Matter Experts.
  • The candidates simply did not achieve expected
    performance.

49
Rating continued...
  • During the rating process, several raters
    observed that the existing standards also did not
    serve to sufficiently distinguish among
    candidates who were listing varying numbers of
    primary and secondary actions.
  • For example, in item 2, to obtain a score of 1, a
    candidate needed to list 2 primary actions to
    get a score of 2, a candidate needed to list a
    total of 6 actions, 3 of which must be primary
    actions.
  • Raters indicated that there were a number of
    candidates who were listing several actions, but
    did not achieve the rating standard needed for a
    2.
  • This observation was experienced in other items
    as well.

50
Crosstab Report for Revised Item 2
  • Refer to Handout (C) Attachment C-2

51
Crosstab Report for Revised Item 8X
  • Refer to Handout (C) Attachment C-2

52
Revised Rating Standards
53
Revised Results Item 2
Original Standards
Revised Standards
54
Revised Results Item 8X
Original Standards
Revised Standards
55
Descriptive Statistics - Revised Standards in
Descending Order of Difficulty
56
Let's look at Item Analysis for Written
Simulation Format Items
  • Before we look at the item analysis for written
    simulation items, let's first look at a few
    written simulation items
  • Correction Captain Written Simulation Exercise

57
Here we go again with Reminders
  • Item analysis generally involves a tabular
    presentation of the way candidates answered a
    given item
  • The purpose of the test is to make distinctions
  • The question remains, How do we know when a test
    item is doing its job, be it multiple choice,
    essay or written simulation?
  • Let's make a first pass at attempting to answer
    this question for a set of Simulation items

58
Simulation ProblemItem Review
59
Let's look at Item Analysis
For a few specific Written Simulation Exercise
Items
  • A review of the tabular presentation of results
    will also help us assess whether or not the
    written simulation exercise item is fulfilling
    its lifes mission...

60
Simulation ProblemItem Review
  • Refer to Handout (D)

61
Simulation ProblemItem Review
62
Simulation ProblemItem Review
63
Simulation ProblemItem Review
64
Simulation ProblemItem Review
65
Simulation ProblemItem Review
66
Simulation ProblemItem Review
67
Item Analysis - The Bridge between a Test Item
and our understanding of its functioning
Although simplistic in its appearance, Item
Analysis can be a powerful tool to assist us in
our evaluation of a Test Item (whatever its
format) and help the examiner determine its
relative contribution to making distinctions
among candidates.
68
Bet you wish you were here about now...
Write a Comment
User Comments (0)
About PowerShow.com