Title: Test Items and Item Analysis
1Test Items and Item Analysis
- Psy 427
- Cal State Northridge
- Andrew Ainsworth PhD
2Item Formats
- Dichotomous Format
- Two alternatives
- True/False
- MMPI/2 MMPI/A
- Polytomous or Polychotomous Format
- More than two alternatives
- Multiple choice
- Psy427 Midterm, SAT, GRE,
3Item Formats
- Distractors
- Item Formats
- Incorrect choices on a polychotomous test
- Best to have three or four
- BUT -
- one study (Sidick, Barret, Doverspike, 1994)
found equivalent validity and reliability for a
test with two distractors (three items) as one
with four distractors (five items). - SO, best might be to have two to four (further
study is needed)
4Should you guess on polytomous tests?
- Depends Correction for guessing
- R is the number correct
- W is the number incorrect
- n is the number of polytomous choices
- If no correction for guessing, guess away.
- If there is a correction for guessing, better to
leave some blank (unless you can beat the odds)
5Other Test Items
- Likert scales
- On a rating scale of 1-5, or 1-6, 1-7, etc. where
- 1 strongly disagree
- 2 moderately disagree
- 3 mildly disagree
- 4 mildly agree
- 5 moderately agree
- 6 strongly agree
- rate the following statements.
6Other Test Items
- Likert scales
- Even vs. odd number of choices
- Even numbers prevents fence-sitting
- Odd numbers allows people to be neutral
- Likert items are VERY popular measurement items
in psychology. - Technically ordinal but are often assumed
continuous if 5 or more choices - With that assumption we can calculate means,
factor analyze, etc.
7Other Test Items
- Category format
- Like Likert, but with MANY more categories
- e.g., 10-point scale
- Best if used with anchors
- Research supports use of 7-point scales to
21-point scales
8Other Test Items
- Visual Analogue Scale
- No Headache Worst Headache
- Also used in research
- dials, knobs
- time sampling
9Checklists Q-Sorts
- Both used in qualitative research as well as
quantitative research - Checklists
- Present list of words (adjectives)
- Have person choose to endorse each item
- Can determine perceptions of concepts using
checklists.
10Checklists Q-Sorts
- Adjective Checklists (from http//www.encyclopedia
.com/doc/1O87-AdjectiveCheckList.html) - In psychometrics, any list of adjectives that can
be marked as applicable or not applicable - to oneself
- to one's ideal self
- to another person, OR
- to some other entity or concept.
11Checklists Q-Sorts
- Checklists
- When written with initial uppercase letters
(ACL), the term denotes more specifically a
measure consisting of a list of 300 adjectives,
from absent-minded to zany - Selected by the US psychologist Harrison G. Gough
(born 1921) and introduced as a commercial test
in 1952. - The test yields 24 scores, including measures of
personal adjustment, self-confidence,
self-control, lability, counselling readiness,
some response styles, and 15 personality needs,
such as achievement, dominance, and endurance.
12Checklists Q-Sorts
- Q-Sorts
- Introduced by William Stephenson in 1935
- PhD in physics 1926 PhD in psychology in 1929
- Student of Charles Spearman
- Goal to get a quantitative description of a
persons perceptions of a concept - Process give subject a pile of numbered cards
have them sort them into piles - Piles represent graded degrees of description
(most descriptive to least descriptive).
13Checklists Q-Sorts
- Q-Sorts
- Means of self-evaluation of clients current
status - The Q-Sort consists of a number of cards, often
as many as 40 or 50, even 100 items each
consisting of a single trait, belief, or
behavior. - The goal is to sort these cards into one of five
columns ranging from statements such as, very
much like me to not at all like me. - There are typically a specific number of cards
allowed for each column, forcing the client to
balance the cards evenly. - Example
- California Q-sort , Attachment Q-sort
14Example Q-sort
15California Q-Sort
16Attachment Q-sort
Attachment Q-sort Distribution (number of items
per pile designated)
17Item Analysis
- Methods used to evaluate test items.
- What are good items?
- Techniques
- Item Difficulty (or easiness)
- Discriminability
- Extreme Group
- Item/Total Correlation
- Item Characteristic Curves
- Item Response Theory
- Criterion-Referenced Testing
18Item Difficulty
- The proportion of people who get a particular
item correct or that endorse an item (if there is
no correct response, e.g. MMPI) - Often thought of as the items easiness because
it is based on the number correct/endorsed
19Item Difficulty
- The difficulty can be given in proportion for or
it can be standardized in to a Z-value
20Item Difficulty
- For example a test with the difficulty of .84
21Difficult Item (35)
- If you are taking a criterion referenced test in
a social psychology course and you need to score
a 92 in order to get an A, the criterion is - Social Psychology
- Scoring a 92
- Getting an A
- Not enough info.
22Difficult Item (35)
23Moderate Item (51)
- The correlation between X and is .54. X has a
SD of 1.2 and Y has a SD of 5.4. What is the
regression coefficient (b) when Y is predicted by
X? - .12
- 2.43
- .375
- .45
24Difficult Item (51)
25Easy Item (100)
- For the following set of data 5 9 5 5 2 4
, the mean is - 4
- 5
- 4.5
- 6
26Difficult Item (100)
27Optimum Difficulty
- Mathematically half-way between chance and 100.
- Steps (assuming a 5-choice test)
- Find half-way between 100 and chance
- 1 - .2 .8, .8/2 .4
- Add this value to chance alone
- .4 .2 .6
- Alternately Chance 1.0 / 2 optimum
difficulty - A good test will have difficulty values between
.30 and .70
28Discriminability
- Can be defined in 2 ways
- How well does each item distinguish
(discriminate) between individuals who are
scoring high and low on the test as a whole (e.g.
the trait of interest). - Or simply how well is each item related to the
trait (e.g. loadings in factor analysis) - 1 and 2 are really the same the more an item is
related to the trait the better it can
distinguish high and low scoring individuals
29Discriminability
- Extreme Group Method
- First
- Identify two extreme groups
- Top third vs. bottom third
- Second
- Compute Difficulty for the top group
- Compute Difficulty for the bottom group
- Compute the difference between Top Difficulty and
Bottom Difficulty - Result Discriminability Index
30(No Transcript)
31Discriminability
- Item/Total Correlation
- Let the total test score stand in for the trait
of interest a roughly estimated factor of
sorts - Correlate each item with the total test score
items with higher item/total correlations are
more discriminating - These correlations are like rough factor loadings
32Discriminability
- Point Biserial Method
- If you have dichotomous scored items (e.g. MMPI)
or items with a correct answer - Correlate the proportion of people getting each
item correct with total test score. - One dichotomous variable (correct/incorrect)
correlated with one continuous variable (total
score) is a Point-Biserial correlation - Measures discriminability
33Discriminability
34Discriminability
- The discimination can be standardized in to a
Z-value as well
35Discriminability
- The discimination can be standardized in to a
Z-value as well
36Discriminability
37Selecting Items
- Using Difficulty and Discrimination together
38Item Characteristic Curves
- A graph of the proportion of people getting each
item correct, compared to total scores on the
test. - Ideally, lower test scores should go along with
lower proportions of people getting a particular
item correct. - Ideally, higher test scores should go along with
higher proportions of people getting a particular
item correct.
39Item Characteristic Curves
40Item Characteristic Curves
41Item Characteristic Curves
42Item Characteristic Curves
43Item Characteristic Curves
44Item Characteristic Curves
45Item Characteristic Curves
46Item Characteristic Curves
47Item Characteristic Curves
48Item Characteristic Curves
49Item Characteristic Curves
50Item Characteristic Curves
51Item Characteristic Curves
52Item Characteristic Curves
53Item Characteristic Curves
54Item Characteristic Curves
55Item Characteristic Curves
56Item Characteristic Curves
57Item Characteristic Curves
58Item Characteristic Curves
59Item Characteristic Curves
60Item Characteristic Curves
61Item Characteristic Curves
62Item Characteristic Curves
63Item Characteristic Curves
64Item Characteristic Curves
65Item Characteristic Curves
66Other Evaluation Techniques
- Item Response Theory
- viewing item response curves at different levels
of difficulty - Looks at standard error at different ranges of
the trait you are trying to measure - More on this in the next topic
67Other Evaluation Techniques
- Criterion-Referenced Tests
- Instead of comparing a score on a test or scale
to other respondents scores we can compare each
individual to what they should have scored. - Requires that there is a set objective in order
to assess whether the objective has been met - E.g. In intro stats students should learn how to
run an independent samples t-test a criterion
referenced test could be used to test this. This
needs to be demonstrated before moving on to
another objective.
68Other Evaluation Techniques
- Criterion-Referenced Tests
- To evaluate CRT items
- Give the test to 2 groups one exposed to the
material and one that has not seen the material - Distribute the scores for the test in a frequency
polygon - The antimode (leasts frequent value) represents
the cut score between those who were exposed to
the material and those who werent - Scores above the cut score are assumed to have
mastered the material, and vice versa
69Criterion Referenced Test
70Other Evaluation Techniques
- Criterion-Referenced Tests
- Often used with Mastery style learning
- Once a student indicates theyve mastered the
material he/she moves on to the next module of
material - If they do not pass the cut score for mastery
they receive more instruction until they can
master the material