Title: Introduction to Modern Measurement SEM and IRT
1Introduction to Modern Measurement SEM and IRT
- Paul Crane, MD MPH
- Rich Jones, ScD
- Friday Harbor 2006
2Outline
- Intro and definitions
- Factor analysis and link to IRT
- Parametric IRT models
- Information curves and rational test design
- Model assumptions
- Applications in cognition
3Use of modern measurement tools
- Educational testing since 1968, most
educational tests built using this framework - SATs, GREs, high-stakes tests, MCATs, LSATs
- Increasing in psychology in general
- In the medical arena, increasing use in patient
reported outcomes (health-related quality of
life, depression, etc.) - PROMIS
- Most people who have used these tools on
cognitive functioning tests are in this room
4Definitions of measurement
- The assignment of numerals to objects or events
according to some rule (Stevens 1946) - The numerical estimation and expression of the
magnitude of one quantity relative to another
(Michell, 1997) - These definitions imply different perspectives on
what we do with test data
5Purposes of measurement
- (After Kirshner and Guyatt)
- Discriminative purpose
- Evaluative purpose
- Predictive purpose
- Statistical properties desired differ according
to the purpose of measurement - Often tests are used for multiple purposes,
without necessarily documenting appropriateness
for intended purpose - Well come back to this
6Latent traits or abilities
- The underlying thing we are trying to measure
with a test - Cant be directly observed (hence latent)
- Cause observable behavior (such as responses on a
test) - A task of psychometrics is to try to determine
levels of latent traits or abilities based on
responses to test items
7Factor analysis
- Factor analysis history intimately involved with
history of psychometrics (Spearman, Thurstone) - Analyze a covariance (correlation) matrix
- Exploratory factor analysis / principal
components analysis - Identify a factor that explains most of the
variance in the matrix - Extract the factor and determine whats left
(residual matrix) - (Rotate)
- Repeat
8CFA
- Theory driven
- Some relationships between specific factors and
indicators are specified to be 0 - Fit always worse than EFA (which can be proven to
have optimal fit) - Single factor CFA very useful
9Picture of single factor CFA
Latent trait
Item 1
Item 4
Item 3
Item 2
Item 6
Item 5
10Relationship between CFA and IRT
- I have long believed that a very general version
of the common factor model supplies a rigorous
and unified treatment of the major concepts and
techniques of test theory. The implicit unifying
principle throughout the following pages is a
general nonlinear common factor model, which
includes item response models as special cases,
and includes also the (linear) common factor
model as an approximation. - Roderick McDonald, Test Theory A Unified
Treatment (1999) p. x.
11Single common factor math
- X1 ?1F E1
- X2 ?2F E2
- X3 ?3F E3
-
- Item 1 response (X1) is the sum of a common part
the loading (?1) times the amount of the factor
(F), plus a unique part (E1)
12Enhanced picture of single factor CFA
F
?1
?6
?2
?3
?5
?4
X 1
X 4
X 3
X 2
X 6
X 5
E3
E4
E5
E6
E1
E2
13Dichotomous / categorical items
- Our picture is of continuous predictors X1-X6
- In practice items are not continuous, they are
categorical or dichotomous - Has implications for the unique parts (E1-E6)
- A new parameter needed for threshold(s)
14(Tetrachoric and polychoric)
- Maps a continuous underlying level (Xj) to a
dichotomous indicator Xj - Requires a threshold parameter tj
- If Xj gt tj, Xj 1
- If Xj tj, Xj 0
- Observed data for Xj and Xk are used to estimate
Xj and Xk, and the correlation between Xj and
Xk is the tetrachoric correlation - Extension to gt2 categories polychoric correlation
15(Tetrachoric and polychoric picture)
Thanks Rich!
16Item response models
- Introduce a new character ? (theta)
- We met ? as F before
- The (level of the) underlying trait (ability)
measured by the test - Starts with dichotomous (polytomous) items rather
than with continuous correlation matrix ends up
in the same place
17(Nonparametric IRT)
- Monotonic increasing relationship between theta
and item responses - Obtain ordinal relationships among test takers
- Can use software to determine whether shapes of
curves look parametric - Sijtsma and Molenaar, Introduction to
Nonparametric Item Response Theory (2002) - MSP5 for Windows
- (I am aware of only one non-parametric paper
published in medical settings Thanks Rich)
18Parametric IRT
- (Misquoting Box) All parameterizations are
wrong, but some are useful. - Cumulative normal vs. logistic (logistic has won
because of ease of computation no practical
difference) - Number of parameters 4PL models (!)
- Extensions to polytomous items vary
191PL item characteristic curves
201PL (aka Rasch model)
- Single parameter for the item is item difficulty
(b) - P(Y1?,a) exp(?-b)/1exp(?-b)
- Mathematically the same to write as
- 1exp(-1(?-b))-1
- The distance between a subjects latent trait
level ? and the items difficulty level b
determines the probability of endorsing the item
(or getting the item right) - All of the loadings for all of the items are
fixed (?1 ?2 ?k)
212PL item characteristic curves
222PL
- A second parameter is added to account for
varying strengths of relationship between items
and ?. Known as discrimination (a). - P(Y1?,a,b) exp(a(?-b))/1exp(a(?-b))
- Relationship between ? and b still drives item
responses - The a parameter allows our loadings to vary
- The constant D (1.702) is often included in
formulas. This constant makes the logistic
curves approximate the normal ogive curves - P(Y1?,a,b) exp(Da(?-b))/1exp(Da(?-b))
231PL vs. 2PL
- Vociferous debates in educational testing
literature, related to specific objectivity
from the Rasch literature - (Has not been an issue in medicine)
- The difficulty parameter is MUCH more important
for subject scores than the discrimination
parameter - ? scores estimated from the 1PL and 2PL models
are incredibly highly correlated - 2PL model provides additional insight into how
good the items may be
24(3PL and 4PL)
- 3PL incorporates a guessing parameter even
subjects with very low ability will have a
non-zero probability of picking the correct
answer at random in a multiple choice test - P(Y1?,a,b,c)
- c(1-c)exp(Da(?-b))/1exp(Da(?-b))
- 4PL incorporates a attractive distractor
parameter even subjects with very high ability
may be distracted by a nearly correct alternative - Neither model is relevant for tests that do not
have multiple choice response formats
25(Polytomous IRT)
- 2PL extension called the Graded Response Model
(GRM), Samejima (1969) // to proportional odds
model for ordinal logistic regression - 1PL extensions Partial Credit Model (PCM),
Generalized Partial Credit Model (GPCM), Rating
Scale Model (RSM)
26Reliability vs. information
- Reliability is a key feature in classical test
theory - McDonalds omega, Cronbachs alpha KR-20
(Kuder-Richardson, formula 20) - Provides a single number for the proportion of
the total score that is true - Assumes measurement error is constant for a test
- IRT focus shifted to information
- Information analogous to measurement precision
- Varies according to item parameters
27Information - intuitive
- If there are a lot of hard items, there will be
relatively more measurement precision for
individuals with large ? scores - If there are few hard items, there will be less
measurement precision for individuals with large
? scores - Precision for individuals with large ? scores
tells us nothing about precision for individuals
with small ? scores (would need to know about
easy items rather than hard items)
28(Information formulas)
- General I(?) P(?)2/P(?)1-P(?)
- P(?) is the first derivative of P(?)
- 2PL model I(?) D2a2P(?)1-P(?)
- P(?)1-P(?) part makes a hill around the point
where ? b - P(?) approaches 0 as ?ltltb
- 1-P(?) approaches 0 as ?gtgtb
- D2 is a constant
- a2 is proportional to the height of the
information curve around the point where ? b
29Information - test
- Test information is simply the sum of all of the
item information curves - (local independence)
- SEM 1/SQRT(I(?))
30Kirshner and Guyatt revisited
31SEM for common screening tests
Red MMSE, blue 3MS, black CASI, green CSI
D.
32Rational test design
- Start with an item bank (items all in the same
domain whose parameters have been established) - Select items in order to fill out a specific
information curve - Shape of curve may vary based on purpose of
measurement - Mungas et al. matched information curves for
memory, executive functioning, and global
cognition - My workgroup replicating this specific aspect of
the project
33Parameterization isnt free
- There are assumptions to IRT and to the models we
use - Unidimensionality
- Local independence
- Model fits the data well
- Well look at each of these assumptions in turn
34Unidimensionality
- Intuitively IRT is a single factor model fit to
the data. If it is a bad idea to use a single
factor model on the data, then IRT is a bad idea
too. - Pure unidimensionality is impossible emphasis
has shifted to essential or sufficient
unidimensionality - Analyses focus on the residual covariance matrix
from the single common factor model - A strong factor emerging from the residual
covariance matrix hints that a single factor may
not explain the data well
35Bi-factor model and unidimensionality
- Instead of EFA approach on the residual
covariance matrix, use a CFA approach - Suggested by McDonald (1999, 1990) recently have
seen more of this (Atlanta 2/2006) - Theory precedes analysis have to have
pre-specified subdomains
36Bi-factor model of executive functioning
37Standardized loadings (gt0.30)
38Local independence
- Residual correlation should be no larger than
chance - Violations several items related to a common
passage (testlet) trials on a memory test - Not clear how robust the models are to violations
of local independence - (Information curve height may be artificially
high if there are items with local independence)
39Model fit
- No established criteria for 2PL and polytomous
items - (But Bjorner et al. 2005 poster at ISOQOL)
- ?2 statistics produced by PARSCALE are
sample-size dependent - There are fit statistics that have been developed
for SEM models, and benchmarks have been
established - RMSEA, CFI available from MPLUS for categorical
items (Rich is that right?)
40Why bother?
- Theoretical reasons
- Rational scoring system based on empiric
relationships rather than unlikely assumptions of
pre-specified weights and thresholds - Linearity of IRT scale
- Utility of information (as opposed to alpha)
- Practical reasons we have found stronger
relationships for IRT scores than for standard
scores with outcomes of interest (imaging
dementia status) - There are a lot more things we can do with IRT
than with CTT (see next slide)
41Applications of IRT - cognition
- Co-calibration of scales (FH 2004)
- Generate psychometrically matched scales (same
information curves) simultaneously in Spanish and
English (Dans work on the SENAS) - Determine and account for differential item
functioning (DIF) (FH 2005) - Determine and account for varying measurement
precision (FH 2004) - A single composite score instead of multiple
scores for analyses (FH 2006)
42How to do it?
- Off-the-shelf IRT package, e.g. PARSCALE (FH
2006) - Off-the-shelf SEM package, e.g. MPLUS (FH 2006)
- Increasingly common write your own Bayesian code
using WINBUGS (FH 2008???) - SAS PROC NLMIXED (Sheu et al. 2005) (FH 2008?)
- STATA user-written code?
43Practical considerations
- Data requirements
- Large data set (500 or so)
- Long enough scales (5 items an absolute minimum
more is better) - Sample with heterogeneous levels of ?
- Software fails with sparse response categories
- We will demonstrate a STATA program that runs
PARSCALE (runparscale) - We will also demonstrate a STATA program that
runs MPLUS (runmplus)
44New topic DIF
- DIF defined an item has DIF if, when controlling
for ?, subjects in different groups have
different probabilities of item responses for
that item - Item-level bias
- In math P(Y1 ?,Group) ? P(Y1 ?)
- Group membership interferes with the expected
tight relationship between ? and item response
45DIF detection
- Plethora of techniques
- IRT techniques (FH 2005)
- SEM techniques (MIMIC model) (FH 2005)
- (Ordinal) logistic regression techniques
- Hybrid ordinal logistic regression / IRT
technique (FH 2005) - Each technique hinges on identifying items with
no DIF to serve as anchors
46IRT DIF detection- principles
- Compare IRT curves calibrated separately from the
group(s) - In the absence of DIF they will be superimposed
- What is theta?
- Iterative algorithms for identifying items that
are free of DIF to serve as anchors - Statistical significance typically used
- Also Raju developed tests based on area between
curves (mathematically equivalent to statistical
significance tests)
47SEM techniques - principles
- Measurement model with covariates
- Initially specify a model with a main effect for
the covariate, and indirect effects for the
covariate on each of the items, initially set to
0 (see next slide) - Look at Modification Indices, improvements of fit
for freeing one constraint at a time
48MIMIC model picture
F
X
0
0
0
0
0
0
X 1
X 4
X 3
X 2
X 6
X 5
49OLR - principles
- Fit nested (ordinal) logistic regression models
- P(Y1X,g)ß1Xß2gß3(Xg) (model 1)
- P(Y1X,g)ß1Xß2g (model 2)
- P(Y1X,g)ß1X (model 3)
- NUDIF is statistical significance of the
interaction term (LL difference between models 1
and 2) - UDIF
- statistical significance of the group term
(Hambleton et al, 1991) - Proportional change in ß1 from models 2 and 3
(Crane et al. 2004)
50Hybrid IRT-OLR - principles
- Techniques that rely on standard sum score (X) at
least theoretically flawed (Millsap and Everson) - Substitute IRT-derived ? for X (Crane et al.
2004) - P(Y1?,g)ß1?ß2gß3(?g) (model 1)
- P(Y1?,g)ß1?ß2g (model 2)
- P(Y1?,g)ß1? (model 3)
51What to do when we find DIF?
- Ignore
- Omit items found with DIF from the scale
- Account for DIF items by generating and using
demographic-specific item parameters (Reise et
al.)
52Demographic specific item parameters data
structure
53DIF presence vs. DIF impact
- So far have addressed the question Is it there?
- Have not addressed the question Does it matter?
- We can determine how much individual scores
change, and display graphically - Executive functioning items analyzed at FH 2005
(see next slide)
54DIF impact
55Relationships to external criteria
- Concurrent head MRI with volumes of white matter
hyperintensities and total brain volume - Amount of variance in scores explained by MRI
measures - Total sum score 0.11 IRT score 0.13. IRT
score accounting for DIF 0.16 - Removing the nuisance of DIF (Shealy and Stout)
- Publicly available STATA code (FH 2006)