Interpretation: How to Use Psychometrics - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Interpretation: How to Use Psychometrics

Description:

Previous talks were generally about one topic. Today's presentation: Where does this stuff come up at ... First, a FMC: 'You messed up the order of the items! ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 55
Provided by: ebu45
Category:

less

Transcript and Presenter's Notes

Title: Interpretation: How to Use Psychometrics


1
Interpretation How to Use Psychometrics
2
A Different Format
  • Previous talks were generally about one topic
  • Todays presentation Where does this stuff come
    up at MP, outside of the psychos?
  • A little bit of info on several different things

3
The goals
  • Understand various psychometric analyses as they
    arise in day-to-day work
  • See which stats are used in different
    applications
  • Answer questions

4
Topics Covered
  • Things youd find in a key verification file
  • Classical stats (p-values, point-biserials)
  • Things youd find at a form pulling
  • IRT stats (TCCs, TIFs)
  • Things youd find in a technical manual
  • All sorts of info
  • A question youd hear at a standard setting
  • IRT

5
1. Key Verification Files
  • Purpose To check the correctness of answer keys
    (MC items)
  • A list of items whose stats are unusual or merit
    further investigation
  • Items identified based on their p-values and/or
    point-biserials

6
  • P-value The proportion of students answering an
    item correctly
  • How easy is the item?
  • Point-biserial The correlation between item
    score and total score
  • If you do well on the item, do you tend to do
    well on the test?

7
When might we be alarmed?
  • Not many kids are picking the right answer
  • The p-value is low (less than .25)
  • Low-performing kids are doing better on the item
    than high-performing kids
  • The point-biserial is low (less than .15)
  • and/or
  • If an incorrect answer choice has strange stats

8
Distractor Stats
  • Distractor p-value The proportion of students
    picking the distractor (say, choice C when the
    correct answer is B)
  • How popular is choice C?
  • Flag item if distractor p-value is higher than .3
  • Distractor point-biserial The correlation
    between picking the distractor and total test
    score
  • If you picked C, how well did you tend to do on
    the test?
  • Flag item if distractor PBS is positive

9
An Operational Example
  • A recent item had the following stats
  • Key D
  • P-value 0.10
  • Point-biserial -0.02
  • P-value for C 0.60
  • Point-biserial for C 0.20
  • So the key was wrong? Nope

10
How Can That Happen?
  • An example What is the definition of the word
    travesty?
  • A Mockery
  • B Injustice
  • C Bellybutton
  • D Some even stupider answer than bellybutton
  • Actual definition Any grotesque or debased
    likeness or imitation
  • The correct answer is A, but travesty of
    justice threw off the high-performing students

11
To sum up
  • Psychometrics can help us identify items whose
    keys need to be checked
  • Stats used
  • P-values
  • Point-biserials
  • Distractor p-values and point-biserials
  • P-values point-biserials should be relatively
    high, distractor values should be relatively low
  • The key usually turns out to be right, but thats
    OK

12
2. Form Pulling
  • Context We are choosing items for next years
    exam
  • Clients like to look at psychometric info when
    picking items (e.g., MCAS)
  • We know the stats ahead of time because items
    were field-tested
  • Relevant stats Test Characteristic Curves
    (TCCs), raw score cut points, Test Information
    Functions (TIFs)

13
  • This stuff relates to Item Response Theory (IRT)
  • TCC is a plot that tells you the expected raw
    score for each value of ability (denoted theta)
  • As ability increases, expected raw score increases

14
Example of a TCC 5 Items
15
Raw Score Cut Points
  • Suppose test has 4 performance levels Below
    Basic, Basic, Proficient, Advanced
  • How many points do you need in order to reach the
    Basic level? Proficient? Advanced?
  • Example Test goes from 0 to 72. Need 35 to
    reach Basic 51 to reach Proficient 63 to reach
    Advanced
  • Standard Setting often tells us theta cut points
    clients want to know raw score cuts

16
Using the TCC to find a cut point
  • Suppose theta cut is 0.4
  • Find expected raw score at 0.4 using the TCC. It
    is 3.3
  • Cut is placed between 3 and 4

17
Test Information Functions
  • TIFs tell us the test precision at each level of
    ability
  • The higher the curve, the more precision
  • Easy items give us precision for low values of
    theta. Similarly
  • Hard items give precision at high values
  • Medium items give precision at medium values

18
Example of a TIF
19
Why does the client care?
  • It is often desired that next years forms are
    similar to this years forms
  • Make sure tests are correct difficulty (TCC, RS
    cut points) precision (TIF)
  • Match TCCs, cut points, TIFs of the two years

20
Why should the forms be similar?
  • Theoretically, we should be able to account for
    differences through equating (Liz)
  • However, want the student experience to be
    similar from year to year
  • Dont want to give easy test to Class of 07,
    hard test to Class of 08
  • Dont want to make this years test less precise
    than last years

21
Example 2007 MCAS, Grade 10 Math
  • Proposed 2007 TCC was lower than last years
  • Solution Replace some hard items with easy
    items

22
Example, Continued
  • Proposed 2007 TIF had less info at low abilities,
    more info at high abilities
  • Solution
  • Replace some hard items with easy items
  • Use hard items with lower PBS, easy items with
    higher PBS

23
Example, Continued
  • Proposed 2007 raw score cuts lower than 2006 raw
    score cuts
  • Solution Replace some hard items with easy items

24
Guide to making changes
  • Some rules-of-thumb for different problems

25
To sum up
  • Item Response Theory is useful in form pulling
  • TCCs, raw score cuts, TIFs are often examined
  • Proposed values should be similar to current
    years
  • Tests shouldnt be too easy or hard
  • Tests should be informative but not too
    informative
  • Its helpful to know how we can change these
    things based on item stats

26
3. Technical Manuals
  • Things in Technical Manuals vary from program to
    program
  • Often see some of the following
  • P-values and point-biserials (thanks Louis!)
  • Test reliabilities (thanks Louis!)
  • TCCs and TIFs (thanks Mike!)
  • DIF (thanks Won!)
  • Standard Setting (thanks Liz and Abdullah!)
  • Equating (thanks in advance Liz!)
  • Inter-rater reliability (thanks for nothing!)
  • Decision consistency and accuracy (ditto)

27
Technical Manuals P-Values Point-Biserials
  • Youll often see a table like this

28
Technical Manuals Reliabilities (and other
stats)
  • Louis said Reliability is the correlation
    between scores on parallel forms. Higher
    reliability ? Greater consistency
  • Youll often see a table like this

29
Technical Manuals TCCs and TIFs
  • Give TCC, TIF of each grade / content area

30
Technical Manuals DIF
  • Won said An item has DIF if the probability of
    getting the item right is dependent on group
    membership (e.g., gender, ethnic group)
  • Measured Progress uses a method called the
    Standardized P-Difference
  • Comparing groups
  • Male-Female
  • White-Black
  • White-Hispanic
  • Minimum 200 examinees in each group

31
DIF, Continued
  • A -0.05 0.05 negligible
  • B -0.1 -0.05) and (0.05 0.1 low
  • C outside the -0.1 0.1 high
  • A -0.05 0.05 negligible
  • B -0.1 -0.05) and (0.05 0.1 low
  • C outside the -0.1 0.1 high

C
C
A
B
B
32
DIF, Continued
  • You may see a table like this

33
Technical Manuals Standard Setting Equating
  • Liz and Abdullah discussed Standard Setting
  • In technical manuals, youll often see
  • Report / summary of standard setting process
  • Info about panelists (how many, who they are)
  • What method was used (e.g., bookmark / Body of
    Work)
  • Cut points
  • Info about panelist evaluations
  • Equating Come next week and find out!

34
Inter-rater reliability
  • When constructed-response items are rated by
    multiple scorers, how well do raters agree?
  • The more agreement, the better
  • Exact agreement What of the time do they give
    the same score?
  • Adjacent agreement What of the time are they
    off by 1?

35
Decision Accuracy and Consistency Introduction
  • For most programs, four achievement levels, e.g.,
    Below Basic, Basic, Proficient, Advanced
  • Decision accuracy degree to which observed
    categorizations match true categorizations
  • Decision consistency degree to which observed
    categorizations match those of a parallel form

36
Intuitive examples of accuracy
  • TRUE LEVEL Proficient
  • OBSERVED LEVEL Proficient
  • DIAGNOSIS ACCURATE (GOOD)
  • TRUE LEVEL Proficient
  • OBSERVED LEVEL Below Basic
  • DIAGNOSIS INACCURATE (BAD). False negative
  • TRUE LEVEL Basic
  • OBSERVED LEVEL Advanced
  • DIAGNOSIS INACCURATE (BAD). False positive

37
Intuitive examples of consistency
  • OBSERVED LEVEL, Form 1 Basic
  • OBSERVED LEVEL, Form 2 Basic
  • DIAGNOSIS CONSISTENT (GOOD)
  • OBSERVED LEVEL, Form 1 Basic
  • OBSERVED LEVEL, Form 2 Advanced
  • DIAGNOSIS INCONSISTENT (BAD)

38
Decision Accuracy and Consistency Introduction
  • Livingston and Lewis (1995) proposed method of
    estimating decision accuracy/consistency
  • For most programs, many stats are computed. We
    will give an example of each
  • The stats are all based on joint distributions
  • A joint distribution gives the proportion of
    times that 2 things both happen.
  • What proportion of students are truly Basic and
    are observed as Below Basic?

39
Joint Distribution True/Observed Achievement
Levels
  • Overall accuracy 0.7484

True Status
40
Joint Distribution Observed/Observed
Achievement Levels
  • Overall consistency 0.6574

Observed Status Form 1
41
Indices Conditional upon Level
  • Proportion of students correctly classified,
    given true level
  • Proportion of students consistently classified by
    parallel form, given observed level

42
Indices at Cut Points
  • Accuracy consistency at specified cut point
  • Accuracy What is the chance that a student is
    classified on the correct side of a cut point?
  • Consistency What is the chance that a student is
    classified on the same side of a cut point twice?

43
To sum up
  • Lots of stuff in technical manuals
  • Both classical test theory material (p-values,
    point-biserials, reliabilities) IRT material
    (TCCs, TIFs, equating) are important to
    understand
  • Hopefully, these seminars have helped familiarize
    you with their contents

44
4. Standard Setting
  • Comes up all the time outside Psychoville
  • Should be a perfect topic for this talk, but
  • Liz and Abdullah already did a wonderful job

45
4. Standard Setting
  • Standard Setting is the process of recommending
    cut scores between achievement levels
  • Advance (A)
  • Proficient (P)
  • Below Proficient (BP)
  • Failing (F)
  • Focus on one FAQ in bookmark
  • How do we determine the arrangement of items in
    the ordered item booklets?

46
Brief Review of Bookmark
  • Each panelist makes use of the ordered item
    booklet
  • Items in the OIB are presented from easiest to
    hardest. One page per MC item
  • Panelists job is to place bookmark in OIB for
    each cut
  • For a given cut, where do panelists place a
    bookmark?
  • Where they think borderline students would no
    longer have a 2/3 chance (or better) of a correct
    answer
  • Abdullah said cut points are derived from
    bookmark placements

47
(No Transcript)
48
A Very Frequently-Asked Question
  • First, a FMC You messed up the order of the
    items!
  • Then, the FAQ Well, how did you determine the
    order?
  • Important Order is based on actual student
    performance
  • We use the concept of IRT

49
Two MC items Which is easier?
Easier item
Harder item
50
Depending on IRT Model this issue can become
quite complex
51
An Intuitive Explanation
  • An easy item An item that even low-ability
    students get right a high proportion of the time
  • That is, students with small theta values tend to
    get it right
  • Which item has the smallest theta value
    corresponding to a high probability of a correct
    answer?
  • How high a probability? Use 2/3 for consistency
  • IN SUM Easiest item is the one with the
    smallest theta corresponding to p 2/3
  • Hardest has largest theta corresponding to p
    2/3

52
Use the 2/3 Criterion
  • Easiest to hardest Orange, green, red, purple,
    blue
  • Thetas -0.6 0.2
    0.3 0.8 1.2

53
How about polytomous items?
  • A polytomous item is one that has more than 2
    possible scores
  • MC items are dichotmous (0/1), not polytomous
  • Example of polytomous OR item scored 0,1,2,3,4
  • Such an OR item is in the OIB four times, once
    for each score point 1,2,3,4
  • Where do you put this items 4 pages in the OIB?

54
Incorporating polytomous items
  • Just as with dichotmous items, we use IRT
  • What theta do you need to have a 2/3 chance of
    getting a 1 or better? 2 or better? 3 or
    better? 4?
  • The theta must increase as the score increases
  • Suppose the results are -0.4, 0.4, 0.6, 1.8
  • Easiest to hardest Orange, green, red, purple,
    blue
  • Thetas -0.6 0.2
    0.3 0.8 1.2

1
2
3
4
Write a Comment
User Comments (0)
About PowerShow.com