Title: Interpretation: How to Use Psychometrics
1Interpretation How to Use Psychometrics
2A Different Format
- Previous talks were generally about one topic
- Todays presentation Where does this stuff come
up at MP, outside of the psychos? - A little bit of info on several different things
3The goals
- Understand various psychometric analyses as they
arise in day-to-day work - See which stats are used in different
applications - Answer questions
4Topics Covered
- Things youd find in a key verification file
- Classical stats (p-values, point-biserials)
- Things youd find at a form pulling
- IRT stats (TCCs, TIFs)
- Things youd find in a technical manual
- All sorts of info
- A question youd hear at a standard setting
- IRT
51. Key Verification Files
- Purpose To check the correctness of answer keys
(MC items) - A list of items whose stats are unusual or merit
further investigation - Items identified based on their p-values and/or
point-biserials
6- P-value The proportion of students answering an
item correctly - How easy is the item?
- Point-biserial The correlation between item
score and total score - If you do well on the item, do you tend to do
well on the test?
7When might we be alarmed?
- Not many kids are picking the right answer
- The p-value is low (less than .25)
- Low-performing kids are doing better on the item
than high-performing kids - The point-biserial is low (less than .15)
- and/or
- If an incorrect answer choice has strange stats
8Distractor Stats
- Distractor p-value The proportion of students
picking the distractor (say, choice C when the
correct answer is B) - How popular is choice C?
- Flag item if distractor p-value is higher than .3
- Distractor point-biserial The correlation
between picking the distractor and total test
score - If you picked C, how well did you tend to do on
the test? - Flag item if distractor PBS is positive
9An Operational Example
- A recent item had the following stats
- Key D
- P-value 0.10
- Point-biserial -0.02
- P-value for C 0.60
- Point-biserial for C 0.20
- So the key was wrong? Nope
10How Can That Happen?
- An example What is the definition of the word
travesty? - A Mockery
- B Injustice
- C Bellybutton
- D Some even stupider answer than bellybutton
- Actual definition Any grotesque or debased
likeness or imitation - The correct answer is A, but travesty of
justice threw off the high-performing students
11To sum up
- Psychometrics can help us identify items whose
keys need to be checked - Stats used
- P-values
- Point-biserials
- Distractor p-values and point-biserials
- P-values point-biserials should be relatively
high, distractor values should be relatively low - The key usually turns out to be right, but thats
OK
122. Form Pulling
- Context We are choosing items for next years
exam - Clients like to look at psychometric info when
picking items (e.g., MCAS) - We know the stats ahead of time because items
were field-tested - Relevant stats Test Characteristic Curves
(TCCs), raw score cut points, Test Information
Functions (TIFs)
13- This stuff relates to Item Response Theory (IRT)
- TCC is a plot that tells you the expected raw
score for each value of ability (denoted theta) - As ability increases, expected raw score increases
14Example of a TCC 5 Items
15Raw Score Cut Points
- Suppose test has 4 performance levels Below
Basic, Basic, Proficient, Advanced - How many points do you need in order to reach the
Basic level? Proficient? Advanced? - Example Test goes from 0 to 72. Need 35 to
reach Basic 51 to reach Proficient 63 to reach
Advanced - Standard Setting often tells us theta cut points
clients want to know raw score cuts
16Using the TCC to find a cut point
- Suppose theta cut is 0.4
- Find expected raw score at 0.4 using the TCC. It
is 3.3 - Cut is placed between 3 and 4
17Test Information Functions
- TIFs tell us the test precision at each level of
ability - The higher the curve, the more precision
- Easy items give us precision for low values of
theta. Similarly - Hard items give precision at high values
- Medium items give precision at medium values
18Example of a TIF
19Why does the client care?
- It is often desired that next years forms are
similar to this years forms - Make sure tests are correct difficulty (TCC, RS
cut points) precision (TIF) - Match TCCs, cut points, TIFs of the two years
20Why should the forms be similar?
- Theoretically, we should be able to account for
differences through equating (Liz) - However, want the student experience to be
similar from year to year - Dont want to give easy test to Class of 07,
hard test to Class of 08 - Dont want to make this years test less precise
than last years
21Example 2007 MCAS, Grade 10 Math
- Proposed 2007 TCC was lower than last years
- Solution Replace some hard items with easy
items
22Example, Continued
- Proposed 2007 TIF had less info at low abilities,
more info at high abilities - Solution
- Replace some hard items with easy items
- Use hard items with lower PBS, easy items with
higher PBS
23Example, Continued
- Proposed 2007 raw score cuts lower than 2006 raw
score cuts - Solution Replace some hard items with easy items
24Guide to making changes
- Some rules-of-thumb for different problems
25To sum up
- Item Response Theory is useful in form pulling
- TCCs, raw score cuts, TIFs are often examined
- Proposed values should be similar to current
years - Tests shouldnt be too easy or hard
- Tests should be informative but not too
informative - Its helpful to know how we can change these
things based on item stats
263. Technical Manuals
- Things in Technical Manuals vary from program to
program - Often see some of the following
- P-values and point-biserials (thanks Louis!)
- Test reliabilities (thanks Louis!)
- TCCs and TIFs (thanks Mike!)
- DIF (thanks Won!)
- Standard Setting (thanks Liz and Abdullah!)
- Equating (thanks in advance Liz!)
- Inter-rater reliability (thanks for nothing!)
- Decision consistency and accuracy (ditto)
27Technical Manuals P-Values Point-Biserials
- Youll often see a table like this
28Technical Manuals Reliabilities (and other
stats)
- Louis said Reliability is the correlation
between scores on parallel forms. Higher
reliability ? Greater consistency - Youll often see a table like this
29Technical Manuals TCCs and TIFs
- Give TCC, TIF of each grade / content area
30Technical Manuals DIF
- Won said An item has DIF if the probability of
getting the item right is dependent on group
membership (e.g., gender, ethnic group) - Measured Progress uses a method called the
Standardized P-Difference - Comparing groups
- Male-Female
- White-Black
- White-Hispanic
- Minimum 200 examinees in each group
31DIF, Continued
- A -0.05 0.05 negligible
- B -0.1 -0.05) and (0.05 0.1 low
- C outside the -0.1 0.1 high
- A -0.05 0.05 negligible
- B -0.1 -0.05) and (0.05 0.1 low
- C outside the -0.1 0.1 high
C
C
A
B
B
32DIF, Continued
- You may see a table like this
33Technical Manuals Standard Setting Equating
- Liz and Abdullah discussed Standard Setting
- In technical manuals, youll often see
- Report / summary of standard setting process
- Info about panelists (how many, who they are)
- What method was used (e.g., bookmark / Body of
Work) - Cut points
- Info about panelist evaluations
- Equating Come next week and find out!
34Inter-rater reliability
- When constructed-response items are rated by
multiple scorers, how well do raters agree? - The more agreement, the better
- Exact agreement What of the time do they give
the same score? - Adjacent agreement What of the time are they
off by 1?
35Decision Accuracy and Consistency Introduction
- For most programs, four achievement levels, e.g.,
Below Basic, Basic, Proficient, Advanced - Decision accuracy degree to which observed
categorizations match true categorizations - Decision consistency degree to which observed
categorizations match those of a parallel form
36Intuitive examples of accuracy
- TRUE LEVEL Proficient
- OBSERVED LEVEL Proficient
- DIAGNOSIS ACCURATE (GOOD)
- TRUE LEVEL Proficient
- OBSERVED LEVEL Below Basic
- DIAGNOSIS INACCURATE (BAD). False negative
- TRUE LEVEL Basic
- OBSERVED LEVEL Advanced
- DIAGNOSIS INACCURATE (BAD). False positive
37Intuitive examples of consistency
- OBSERVED LEVEL, Form 1 Basic
- OBSERVED LEVEL, Form 2 Basic
- DIAGNOSIS CONSISTENT (GOOD)
- OBSERVED LEVEL, Form 1 Basic
- OBSERVED LEVEL, Form 2 Advanced
- DIAGNOSIS INCONSISTENT (BAD)
38Decision Accuracy and Consistency Introduction
- Livingston and Lewis (1995) proposed method of
estimating decision accuracy/consistency - For most programs, many stats are computed. We
will give an example of each - The stats are all based on joint distributions
- A joint distribution gives the proportion of
times that 2 things both happen. - What proportion of students are truly Basic and
are observed as Below Basic?
39Joint Distribution True/Observed Achievement
Levels
True Status
40Joint Distribution Observed/Observed
Achievement Levels
- Overall consistency 0.6574
Observed Status Form 1
41Indices Conditional upon Level
- Proportion of students correctly classified,
given true level - Proportion of students consistently classified by
parallel form, given observed level
42Indices at Cut Points
- Accuracy consistency at specified cut point
- Accuracy What is the chance that a student is
classified on the correct side of a cut point? - Consistency What is the chance that a student is
classified on the same side of a cut point twice? -
43To sum up
- Lots of stuff in technical manuals
- Both classical test theory material (p-values,
point-biserials, reliabilities) IRT material
(TCCs, TIFs, equating) are important to
understand - Hopefully, these seminars have helped familiarize
you with their contents
444. Standard Setting
- Comes up all the time outside Psychoville
- Should be a perfect topic for this talk, but
- Liz and Abdullah already did a wonderful job
454. Standard Setting
- Standard Setting is the process of recommending
cut scores between achievement levels - Advance (A)
- Proficient (P)
- Below Proficient (BP)
- Failing (F)
- Focus on one FAQ in bookmark
- How do we determine the arrangement of items in
the ordered item booklets?
46Brief Review of Bookmark
- Each panelist makes use of the ordered item
booklet - Items in the OIB are presented from easiest to
hardest. One page per MC item - Panelists job is to place bookmark in OIB for
each cut - For a given cut, where do panelists place a
bookmark? - Where they think borderline students would no
longer have a 2/3 chance (or better) of a correct
answer - Abdullah said cut points are derived from
bookmark placements
47(No Transcript)
48A Very Frequently-Asked Question
- First, a FMC You messed up the order of the
items! - Then, the FAQ Well, how did you determine the
order? - Important Order is based on actual student
performance - We use the concept of IRT
49Two MC items Which is easier?
Easier item
Harder item
50Depending on IRT Model this issue can become
quite complex
51An Intuitive Explanation
- An easy item An item that even low-ability
students get right a high proportion of the time - That is, students with small theta values tend to
get it right - Which item has the smallest theta value
corresponding to a high probability of a correct
answer? - How high a probability? Use 2/3 for consistency
- IN SUM Easiest item is the one with the
smallest theta corresponding to p 2/3 - Hardest has largest theta corresponding to p
2/3
52Use the 2/3 Criterion
- Easiest to hardest Orange, green, red, purple,
blue
- Thetas -0.6 0.2
0.3 0.8 1.2
53How about polytomous items?
- A polytomous item is one that has more than 2
possible scores - MC items are dichotmous (0/1), not polytomous
- Example of polytomous OR item scored 0,1,2,3,4
- Such an OR item is in the OIB four times, once
for each score point 1,2,3,4 - Where do you put this items 4 pages in the OIB?
54Incorporating polytomous items
- Just as with dichotmous items, we use IRT
- What theta do you need to have a 2/3 chance of
getting a 1 or better? 2 or better? 3 or
better? 4? - The theta must increase as the score increases
- Suppose the results are -0.4, 0.4, 0.6, 1.8
- Easiest to hardest Orange, green, red, purple,
blue
- Thetas -0.6 0.2
0.3 0.8 1.2
1
2
3
4