Title: Approaches to Evaluating Test Items
1 Approaches to Evaluating Test Items
Traditional
and
Not-So-Traditional
Training Workshop
FALL MAPAC MEETING TECHNICAL ISSUES IN
SELECTION JOHN JAY COLLEGE OF CRIMINAL
JUSTICE THE CITY UNIVERSITY OF NEW YORK NEW YORK
CITY SEPTEMBER 29 OCTOBER 1, 1999
2TOPICS
- Traditional methods of evaluating the
effectiveness of test items - Multiple-Choice Test Items
- Not-so-traditional approaches to evaluating
not-so-traditional test items - Short Answer Essay Test Items
- Written Simulation Exercise Items
3Considerations
- What are the issues to consider when evaluating
the effectiveness of test items in an
examination? - What is a test item designed to do and how do we
know when a test item is not doing its job? - What are good item analysis results and what
does good item analysis look like? - Do good test items always generate good item
analysis results? - Does bad item analysis indicate that the item
was bad?
4The Purpose of the Test Item
- To make distinctions!!
- Nothing More and Nothing Less.
5Item Analysis -
- A basic tool for evaluating whether a test item
is performing this function, i.e., is the item
making distinctions?
- Item analysis is a general term for a variety of
statistical methods of evaluating items. - For Multiple-Choice test items, item analysis
generally involves a tabular presentation of the
way candidates answered a given item.
6Item Analysis
Some basic definitions
- P item difficulty or more accurately, the of
candidates getting the answer correct. - H think of it as a grouping of high(er) scoring
candidates taking the test - L think of it is as a grouping of low(er)
scoring candidates taking the test - H L is typically a median split around a
criterion measure, e.g., job performance,
total-test score, part-test score, etc.
7Item Analysis
- iri is a point-biserial correlation between the
test item and the criterion measure multiplied by
the items standard deviation - In English, iri is the degree to which the item
is measuring the same thing as the criterion
measure. Iri is sometimes referred to as item
reliability index or item discrimination index - iris between .10 and .20 indicate that the item
is working in the same direction as the other
items - iris above .20 are working very well, such as in
the above example
8A momentary side-barjust how does one compute
the iri anyway and why should we know how it is
computed?
Where Pg Percent of group getting the item
correct subtest mean for candidates in
groups who got the item correct
subtest mean for all candidates subtest
standard deviation for all candidates item
Standard Deviation
9Iri Estimate Formula
- It is not often easy to determine
subtest mean for candidates in group who got
the item correct - The estimation formula does not replicate
calculations, but can be used as a quick estimate
10Using item analysis, you can tell a lot about an
item, without even looking at the test item
itselfBUT, BE CAREFUL...
- And the answer key for item (1) is.?
- Does the item appear to be fulfilling its lifes
mission, i.e., to make distinctions? - How are the wrong answer choices (distracters)
working? - But why the cautionary, be careful?
- Handout (A) Example of two item analysis reports
11A Look at Item Analysis --Sans the Item
- And the answer key for item 8 is .?
- Does the item appear to be fulfilling its lifes
mission, i.e., to make distinctions? - How are the wrong answer choices (distracters)
working? - But why the cautionary, be careful?
- Now what would you think if I told you that item
8 was mis-keyed. The correct key for the item is
D not A.
12A Look at Item Analysis --Sans the Item
- Now, based again on only the item analysis, and
the answer key for item 8 is.? - Does the item appear to be fulfilling its lifes
mission, i.e., to make distinctions? - How are the wrong answer choices (distracters)
working? - Therefore, the cautionary, be careful.
13A Look at Item Analysis --Sans the Item
- Too heavy a trust upon a review of the item
analysis alone for the purposes of evaluating a
test item is fraught with danger.sounds obvious,
but you might be surprised - Due to the computational formulas used for iri,
high-low numbers are strongly biased in favor of
the selected key. In most cases a change in the
identified key will make a significant difference
in the high-low output for that particular item,
while not significantly changing high-low output
for other items in the subtest.
14Enough Said.
Therefore, just be careful when looking at
numbers. Numbers alone may deceive you...
15An Item Doesnt Exist in a Vacuum
Like most other things in life and nature, an
item must be evaluated in its context.
- The other items in the test or subtest grouping.
- Is it a homogeneous grouping of items that is
intended to measure the same trait? - Or is it a heterogeneous grouping of items that
is designed to measure a broad range of traits or
knowledges? - Answers to these questions will suggest what type
of iri (item reliability index) one would expect
to get...
16An Item Doesnt Exist in a Vacuum
Like most other things in life and nature, an
item must be evaluated in its context.
- The candidates who are in the competition that
are responding to the items, is it a homogeneous
group or a heterogeneous group? - Is it a homogeneous grouping of candidates such
as a promotional candidate field having the same
or similar expertise in the trait being measured? - Or is it a heterogeneous grouping of candidates
such as a sample of high school graduates having
a varying degree of expertise in the trait being
measured? - Again, answers to these questions will suggest
what type of iri (item reliability index) one
would expect to get...
17Enough with the Introduction Stuff, Let's get to
work
- How can item analysis be used to evaluate whether
or not test items have differential item
functioning or have impact? - For purposes of this exercise the referent
differential item functioning or impact at
the item level is defined as a finding of
observable differences between levels of
difficulty among gender and/or ethnic groups.
18Handout (B) - Two Approaches for Consideration
- One to be used as an indicator of difference
think barometer - The other as an indicator of significance in
differences.
- Method One - Using Difference in Item Mean
Difficulties to Evaluate Score Differences
between Candidate Groups
- Method Two - The Chi Square Approach to Evaluate
Score Differences between Candidate Groups
19Method One
- Refer to Handout (B) Attachments B-1, B-2 and B-3
Step 1
- For each item, compute the average item
difficulty for Whites (PW) and Blacks (PB). - (1PW ?, 1PB ? 2PW ?, 2PB ?.etc.)
- Example - item 1
- 1PW 0.63
- 1PB 0.53
20Method One (continued)
Step 2
- Refer to Handout (B) Attachment B-3
Subtract the item difficulties for Whites and
Blacks for each item to obtain an average ethnic
difference rating between Whites and Blacks for
each item (itemEDIF).
- Example - item 1
- 1EDIF 1PW - 1PB
- 1EDIF 0.63 - 0.53
- 1EDIF 0.10
- Example - item 2
- 2EDIF 2PW - 2PB
- 2EDIF 0.82 - 0.60
- 2EDIF 0.22
21Method One (continued)
Step 3
- Refer to Handout (B) Attachment B-3
Compute the mean item difficulty level for all
items in the subtest for Whites ( W ) then
compute the mean item difficulty level for all
items in the subtest for Blacks ( B).
- Example
- W 9.42/15 0.628 mean item difficulty
for Whites - B 7.57/15 0.505 mean item difficulty
for Blacks
22Method One (continued)
Step 4
- Refer to Handout (B) Attachment B-3
Compute an ethnic group difference mean statistic
( EGDIF) by subtracting the subtest mean
difficulty level for the Blacks from the subtest
mean difficulty level for the Whites.
- Example
- EGDIF ( W) - ( B)
- EGDIF 0.628 - 0.505
- EGDIF - 0.123 the mean difference
between White and Black candidates
23Now, go back to your computations in Step 2
Step 5
- Flag any item where its ethnic item difference
statistic (itemEDIF) exceeds the ethnic group
difference mean statistic computed in Step 4 (
EGDIF). - Example Item 1 does not exceed the ethnic group
difference mean statistic, but items 2, 6, 7, 8,
9, 13, 14 and 15 do...
24Assumptions and Implications
- This method of identifying items which have
apparent ethnic differences in results allows any
existing main effect differences between groups
to remain. - This method defines differential item functioning
as any difference greater than existing main
effect differences. - This method assumes that the main effect
differences may reflect real underlying
differences between groups rather than
differential item functioning. - This method does not allow for error variance,
i.e., the possibility that flagged items would
differ upon readministration.
25Method One continued...
- Now what do we do?
- This method of evaluating potential differential
item functioning or impact can be viewed as a
barometer.the higher the number, the greater the
pressure. - Now you must look at the test items to attempt to
determine what is contributing to these results - It would be good to invite a sensitivity review
by representatives of the protected classes and
ask them to suggest reasons why DIF appears to be
operating in the item. - If you are pre-testing your questions, you may
wish to exclude some items from the examination
that seriously exceed average differences.
26Method Two - The Chi Square Approach to Evaluate
Score Differences between Candidate Groups
- Simply described, Chi Square is a nifty little
tool that evaluates the significance of the
differences in response patterns between what we
would expect (if in fact there were no
differences between groups) and what we observe.
27Method Two continued...
Example Let's look at Handout Attachment B-1
and B-2 .specifically the item analysis for item
1
No. Whites 766
- What is the frequency of White Candidates
selecting the correct answer? - 484 (simply add High and Low candidates on A)
28Method Two continued...
No. Whites 766
- What is the of White Candidates selecting the
correct answer? - 63
- What is the of White Candidates NOT selecting
the correct answer? - 37
29Method Two continued...
- Now, if there were no differences in the response
patterns between White and Black candidates on
this item, we would expect that there would be
approximately 63 of the Black candidates or 92
out of 146 candidates selecting the correct
answer and approximately 37 of the Black
candidates or 54 our of 146 getting the item
wrongthat is what we would expect, but what did
we observe?
30Method Two continued...
No. Blacks 146
- What is the frequency of Black candidates
selecting the correct answer? - 77 (simply add High and Low candidates on A)
- What is the of Black candidates selecting the
correct answer? - 53
31Method Two continued...
No. Blacks 146
- What is the frequency of Black candidates NOT
selecting the correct answer? - 69 (simply subtract the total number of Black
candidates selecting the key answer from the
total number of Black candidates) - What is the of Black candidates NOT selecting
the correct answer? - 47
Is the observed key and non-key response pattern
difference of 10 between White and Black
candidates significant?
32Chi-Square
Is relatively easy to compute and can help us get
a handle on the answer.
- While viewing the steps that follow, keep in mind
that the intention is to demonstrate how the Chi
Square technique can be used, NOT how to compute
the Chi Square statistical technique....that
would be another workshop...
Chi Square Formula
Where fo Observed Frequency and fe
Expected Frequency
33Chi-Square
- Remember, when thinking about our question, we
are comparing the key and non-key response
pattern of the Black candidates in this case, to
the key and non-key response pattern of White
candidates - Therefore, what we are summing ( ) is the
difference in response patterns of Black and
White candidates selecting the key and the
response patterns of Black and White candidates
NOT selecting the key.
34Let's recap the numbers and start to fill in the
blanks...
- If there were no differences in response patterns
between White and Black candidate groups, what
would be the Expected Frequency (fe) for Blacks
selecting the key?... 92 (0.63 146) and what
would be the Expected Frequency (fe) for Blacks
NOT selecting the key? 54 (0.37 146) - What is the frequency of Black candidates
selecting the correct answer? - 77 (simply add High and Low candidates on A)
- What is the frequency of Black candidates NOT
selecting the correct answer? - 69 (simply subtract the total number of Black
candidates selecting the key answer from the
total number of Black candidates)
35Let's recap the numbers and start to fill in the
blanks...
36Method Two continued...
- To enter the X2 table of significance, we need
to determine the degrees of freedom (df)in the
two by two Chi Square table that we are
using.. df ( rows - 1) ( columns -1) or
(1)(1) or 1 - The tabled X2 at the .05 level of significance
and 1 df 3.841 - The obtained X2 6.613 is greater than 3.841
and therefore we would reject the null hypotheses
that there is no difference between the response
patterns of Whites and Black to this item.
37Think back to Method One (difference in item mean
difficulty).
- Using that method, item one was not flaggednow,
using Method Two (Chi Square test of
significance) we find that there is a difference
in response patterns between White and Black
candidates - Why?
38Remember the assumptions for Method One?
- This method of identifying items which have
apparent ethnic differences in results allows any
existing main effect differences between groups
to remain. - This method defines differential item functioning
as any difference greater than existing main
effect differences. - This method assumes that the main effect
differences may reflect real underlying
differences between groups rather than
differential item functioning.
The Chi Square method treats differential item
functioning as a significant difference in
performance between the groups.
39Where do we go from here?
Let's look at some non-multiple choice item
formats
40Let's look at ...
Item Analysis for Short-Answer Essay Format Items
- Before we look at the item analysis for essays,
let's first look at a short-answer essay items - Correction Captain Essay Items
41Rating Standards
42Reminderas though we need them
- Item analysis generally involves a tabular
presentation of the way candidates answered a
given item - The purpose of a test is to make distinctions
- The question remains, How do we know when a test
item is doing its job, be it multiple choice
format or essay format? - Let's make a first pass at attempting to answer
this question for a set of Short-Answer Essay
format items...
43Descriptive Statistics
In-Basket Item Means Using Original Rating
Standards - in Descending Level of Difficulty
44Let's look at Item Analysis...
for a few specific short-answer essay items
Remember, item analysis generally involves a
tabular presentation of the way candidates
answered a given item
45Remember what we said about the purpose of a test
item?
- To Make Distinctions
- A review of the tabular presentation of results
will help us assess whether or not the essay item
is fulfilling its lifes mission - Let's take a look...
46Crosstab Report for Item 2
- Refer to Handout (C) Attachment C-1
47Crosstab Report for Item 8X
- Refer to Handout (C) Attachment C-1
48Rating continued...
- A comparison of the candidate performance
reflected in each of the item crosstab tables
strongly supports an argument that the original
standards were too high for this candidate group. - Therefore, a change for a criterion based
rating standard to a performance based standard
was considered. - In this context, the referent criterion is used
to reflect the fact that the candidates were
rated against criteria established by our Subject
Matter Experts. - The candidates simply did not achieve expected
performance.
49Rating continued...
- During the rating process, several raters
observed that the existing standards also did not
serve to sufficiently distinguish among
candidates who were listing varying numbers of
primary and secondary actions. - For example, in item 2, to obtain a score of 1, a
candidate needed to list 2 primary actions to
get a score of 2, a candidate needed to list a
total of 6 actions, 3 of which must be primary
actions. - Raters indicated that there were a number of
candidates who were listing several actions, but
did not achieve the rating standard needed for a
2. - This observation was experienced in other items
as well.
50Crosstab Report for Revised Item 2
- Refer to Handout (C) Attachment C-2
51Crosstab Report for Revised Item 8X
- Refer to Handout (C) Attachment C-2
52Revised Rating Standards
53Revised Results Item 2
Original Standards
Revised Standards
54Revised Results Item 8X
Original Standards
Revised Standards
55Descriptive Statistics - Revised Standards in
Descending Order of Difficulty
56Let's look at Item Analysis for Written
Simulation Format Items
- Before we look at the item analysis for written
simulation items, let's first look at a few
written simulation items - Correction Captain Written Simulation Exercise
57Here we go again with Reminders
- Item analysis generally involves a tabular
presentation of the way candidates answered a
given item - The purpose of the test is to make distinctions
- The question remains, How do we know when a test
item is doing its job, be it multiple choice,
essay or written simulation? - Let's make a first pass at attempting to answer
this question for a set of Simulation items
58Simulation ProblemItem Review
59Let's look at Item Analysis
For a few specific Written Simulation Exercise
Items
- A review of the tabular presentation of results
will also help us assess whether or not the
written simulation exercise item is fulfilling
its lifes mission...
60Simulation ProblemItem Review
61Simulation ProblemItem Review
62Simulation ProblemItem Review
63Simulation ProblemItem Review
64Simulation ProblemItem Review
65Simulation ProblemItem Review
66Simulation ProblemItem Review
67Item Analysis - The Bridge between a Test Item
and our understanding of its functioning
Although simplistic in its appearance, Item
Analysis can be a powerful tool to assist us in
our evaluation of a Test Item (whatever its
format) and help the examiner determine its
relative contribution to making distinctions
among candidates.
68Bet you wish you were here about now...