Introduction to Modern Measurement SEM and IRT - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Introduction to Modern Measurement SEM and IRT

Description:

Starts with dichotomous (polytomous) items rather than with continuous ... Extensions to polytomous items vary. 1PL item characteristic ... (Polytomous IRT ... – PowerPoint PPT presentation

Number of Views:408
Avg rating:3.0/5.0
Slides: 56
Provided by: paulk66
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Modern Measurement SEM and IRT


1
Introduction to Modern Measurement SEM and IRT
  • Paul Crane, MD MPH
  • Rich Jones, ScD
  • Friday Harbor 2006

2
Outline
  • Intro and definitions
  • Factor analysis and link to IRT
  • Parametric IRT models
  • Information curves and rational test design
  • Model assumptions
  • Applications in cognition

3
Use of modern measurement tools
  • Educational testing since 1968, most
    educational tests built using this framework
  • SATs, GREs, high-stakes tests, MCATs, LSATs
  • Increasing in psychology in general
  • In the medical arena, increasing use in patient
    reported outcomes (health-related quality of
    life, depression, etc.)
  • PROMIS
  • Most people who have used these tools on
    cognitive functioning tests are in this room

4
Definitions of measurement
  • The assignment of numerals to objects or events
    according to some rule (Stevens 1946)
  • The numerical estimation and expression of the
    magnitude of one quantity relative to another
    (Michell, 1997)
  • These definitions imply different perspectives on
    what we do with test data

5
Purposes of measurement
  • (After Kirshner and Guyatt)
  • Discriminative purpose
  • Evaluative purpose
  • Predictive purpose
  • Statistical properties desired differ according
    to the purpose of measurement
  • Often tests are used for multiple purposes,
    without necessarily documenting appropriateness
    for intended purpose
  • Well come back to this

6
Latent traits or abilities
  • The underlying thing we are trying to measure
    with a test
  • Cant be directly observed (hence latent)
  • Cause observable behavior (such as responses on a
    test)
  • A task of psychometrics is to try to determine
    levels of latent traits or abilities based on
    responses to test items

7
Factor analysis
  • Factor analysis history intimately involved with
    history of psychometrics (Spearman, Thurstone)
  • Analyze a covariance (correlation) matrix
  • Exploratory factor analysis / principal
    components analysis
  • Identify a factor that explains most of the
    variance in the matrix
  • Extract the factor and determine whats left
    (residual matrix)
  • (Rotate)
  • Repeat

8
CFA
  • Theory driven
  • Some relationships between specific factors and
    indicators are specified to be 0
  • Fit always worse than EFA (which can be proven to
    have optimal fit)
  • Single factor CFA very useful

9
Picture of single factor CFA
Latent trait
Item 1
Item 4
Item 3
Item 2
Item 6
Item 5
10
Relationship between CFA and IRT
  • I have long believed that a very general version
    of the common factor model supplies a rigorous
    and unified treatment of the major concepts and
    techniques of test theory. The implicit unifying
    principle throughout the following pages is a
    general nonlinear common factor model, which
    includes item response models as special cases,
    and includes also the (linear) common factor
    model as an approximation.
  • Roderick McDonald, Test Theory A Unified
    Treatment (1999) p. x.

11
Single common factor math
  • X1 ?1F E1
  • X2 ?2F E2
  • X3 ?3F E3
  • Item 1 response (X1) is the sum of a common part
    the loading (?1) times the amount of the factor
    (F), plus a unique part (E1)

12
Enhanced picture of single factor CFA
F
?1
?6
?2
?3
?5
?4
X 1
X 4
X 3
X 2
X 6
X 5
E3
E4
E5
E6
E1
E2
13
Dichotomous / categorical items
  • Our picture is of continuous predictors X1-X6
  • In practice items are not continuous, they are
    categorical or dichotomous
  • Has implications for the unique parts (E1-E6)
  • A new parameter needed for threshold(s)

14
(Tetrachoric and polychoric)
  • Maps a continuous underlying level (Xj) to a
    dichotomous indicator Xj
  • Requires a threshold parameter tj
  • If Xj gt tj, Xj 1
  • If Xj tj, Xj 0
  • Observed data for Xj and Xk are used to estimate
    Xj and Xk, and the correlation between Xj and
    Xk is the tetrachoric correlation
  • Extension to gt2 categories polychoric correlation

15
(Tetrachoric and polychoric picture)
Thanks Rich!
16
Item response models
  • Introduce a new character ? (theta)
  • We met ? as F before
  • The (level of the) underlying trait (ability)
    measured by the test
  • Starts with dichotomous (polytomous) items rather
    than with continuous correlation matrix ends up
    in the same place

17
(Nonparametric IRT)
  • Monotonic increasing relationship between theta
    and item responses
  • Obtain ordinal relationships among test takers
  • Can use software to determine whether shapes of
    curves look parametric
  • Sijtsma and Molenaar, Introduction to
    Nonparametric Item Response Theory (2002)
  • MSP5 for Windows
  • (I am aware of only one non-parametric paper
    published in medical settings Thanks Rich)

18
Parametric IRT
  • (Misquoting Box) All parameterizations are
    wrong, but some are useful.
  • Cumulative normal vs. logistic (logistic has won
    because of ease of computation no practical
    difference)
  • Number of parameters 4PL models (!)
  • Extensions to polytomous items vary

19
1PL item characteristic curves
20
1PL (aka Rasch model)
  • Single parameter for the item is item difficulty
    (b)
  • P(Y1?,a) exp(?-b)/1exp(?-b)
  • Mathematically the same to write as
  • 1exp(-1(?-b))-1
  • The distance between a subjects latent trait
    level ? and the items difficulty level b
    determines the probability of endorsing the item
    (or getting the item right)
  • All of the loadings for all of the items are
    fixed (?1 ?2 ?k)

21
2PL item characteristic curves
22
2PL
  • A second parameter is added to account for
    varying strengths of relationship between items
    and ?. Known as discrimination (a).
  • P(Y1?,a,b) exp(a(?-b))/1exp(a(?-b))
  • Relationship between ? and b still drives item
    responses
  • The a parameter allows our loadings to vary
  • The constant D (1.702) is often included in
    formulas. This constant makes the logistic
    curves approximate the normal ogive curves
  • P(Y1?,a,b) exp(Da(?-b))/1exp(Da(?-b))

23
1PL vs. 2PL
  • Vociferous debates in educational testing
    literature, related to specific objectivity
    from the Rasch literature
  • (Has not been an issue in medicine)
  • The difficulty parameter is MUCH more important
    for subject scores than the discrimination
    parameter
  • ? scores estimated from the 1PL and 2PL models
    are incredibly highly correlated
  • 2PL model provides additional insight into how
    good the items may be

24
(3PL and 4PL)
  • 3PL incorporates a guessing parameter even
    subjects with very low ability will have a
    non-zero probability of picking the correct
    answer at random in a multiple choice test
  • P(Y1?,a,b,c)
  • c(1-c)exp(Da(?-b))/1exp(Da(?-b))
  • 4PL incorporates a attractive distractor
    parameter even subjects with very high ability
    may be distracted by a nearly correct alternative
  • Neither model is relevant for tests that do not
    have multiple choice response formats

25
(Polytomous IRT)
  • 2PL extension called the Graded Response Model
    (GRM), Samejima (1969) // to proportional odds
    model for ordinal logistic regression
  • 1PL extensions Partial Credit Model (PCM),
    Generalized Partial Credit Model (GPCM), Rating
    Scale Model (RSM)

26
Reliability vs. information
  • Reliability is a key feature in classical test
    theory
  • McDonalds omega, Cronbachs alpha KR-20
    (Kuder-Richardson, formula 20)
  • Provides a single number for the proportion of
    the total score that is true
  • Assumes measurement error is constant for a test
  • IRT focus shifted to information
  • Information analogous to measurement precision
  • Varies according to item parameters

27
Information - intuitive
  • If there are a lot of hard items, there will be
    relatively more measurement precision for
    individuals with large ? scores
  • If there are few hard items, there will be less
    measurement precision for individuals with large
    ? scores
  • Precision for individuals with large ? scores
    tells us nothing about precision for individuals
    with small ? scores (would need to know about
    easy items rather than hard items)

28
(Information formulas)
  • General I(?) P(?)2/P(?)1-P(?)
  • P(?) is the first derivative of P(?)
  • 2PL model I(?) D2a2P(?)1-P(?)
  • P(?)1-P(?) part makes a hill around the point
    where ? b
  • P(?) approaches 0 as ?ltltb
  • 1-P(?) approaches 0 as ?gtgtb
  • D2 is a constant
  • a2 is proportional to the height of the
    information curve around the point where ? b

29
Information - test
  • Test information is simply the sum of all of the
    item information curves
  • (local independence)
  • SEM 1/SQRT(I(?))

30
Kirshner and Guyatt revisited
31
SEM for common screening tests
Red MMSE, blue 3MS, black CASI, green CSI
D.
32
Rational test design
  • Start with an item bank (items all in the same
    domain whose parameters have been established)
  • Select items in order to fill out a specific
    information curve
  • Shape of curve may vary based on purpose of
    measurement
  • Mungas et al. matched information curves for
    memory, executive functioning, and global
    cognition
  • My workgroup replicating this specific aspect of
    the project

33
Parameterization isnt free
  • There are assumptions to IRT and to the models we
    use
  • Unidimensionality
  • Local independence
  • Model fits the data well
  • Well look at each of these assumptions in turn

34
Unidimensionality
  • Intuitively IRT is a single factor model fit to
    the data. If it is a bad idea to use a single
    factor model on the data, then IRT is a bad idea
    too.
  • Pure unidimensionality is impossible emphasis
    has shifted to essential or sufficient
    unidimensionality
  • Analyses focus on the residual covariance matrix
    from the single common factor model
  • A strong factor emerging from the residual
    covariance matrix hints that a single factor may
    not explain the data well

35
Bi-factor model and unidimensionality
  • Instead of EFA approach on the residual
    covariance matrix, use a CFA approach
  • Suggested by McDonald (1999, 1990) recently have
    seen more of this (Atlanta 2/2006)
  • Theory precedes analysis have to have
    pre-specified subdomains

36
Bi-factor model of executive functioning
37
Standardized loadings (gt0.30)
38
Local independence
  • Residual correlation should be no larger than
    chance
  • Violations several items related to a common
    passage (testlet) trials on a memory test
  • Not clear how robust the models are to violations
    of local independence
  • (Information curve height may be artificially
    high if there are items with local independence)

39
Model fit
  • No established criteria for 2PL and polytomous
    items
  • (But Bjorner et al. 2005 poster at ISOQOL)
  • ?2 statistics produced by PARSCALE are
    sample-size dependent
  • There are fit statistics that have been developed
    for SEM models, and benchmarks have been
    established
  • RMSEA, CFI available from MPLUS for categorical
    items (Rich is that right?)

40
Why bother?
  • Theoretical reasons
  • Rational scoring system based on empiric
    relationships rather than unlikely assumptions of
    pre-specified weights and thresholds
  • Linearity of IRT scale
  • Utility of information (as opposed to alpha)
  • Practical reasons we have found stronger
    relationships for IRT scores than for standard
    scores with outcomes of interest (imaging
    dementia status)
  • There are a lot more things we can do with IRT
    than with CTT (see next slide)

41
Applications of IRT - cognition
  • Co-calibration of scales (FH 2004)
  • Generate psychometrically matched scales (same
    information curves) simultaneously in Spanish and
    English (Dans work on the SENAS)
  • Determine and account for differential item
    functioning (DIF) (FH 2005)
  • Determine and account for varying measurement
    precision (FH 2004)
  • A single composite score instead of multiple
    scores for analyses (FH 2006)

42
How to do it?
  • Off-the-shelf IRT package, e.g. PARSCALE (FH
    2006)
  • Off-the-shelf SEM package, e.g. MPLUS (FH 2006)
  • Increasingly common write your own Bayesian code
    using WINBUGS (FH 2008???)
  • SAS PROC NLMIXED (Sheu et al. 2005) (FH 2008?)
  • STATA user-written code?

43
Practical considerations
  • Data requirements
  • Large data set (500 or so)
  • Long enough scales (5 items an absolute minimum
    more is better)
  • Sample with heterogeneous levels of ?
  • Software fails with sparse response categories
  • We will demonstrate a STATA program that runs
    PARSCALE (runparscale)
  • We will also demonstrate a STATA program that
    runs MPLUS (runmplus)

44
New topic DIF
  • DIF defined an item has DIF if, when controlling
    for ?, subjects in different groups have
    different probabilities of item responses for
    that item
  • Item-level bias
  • In math P(Y1 ?,Group) ? P(Y1 ?)
  • Group membership interferes with the expected
    tight relationship between ? and item response

45
DIF detection
  • Plethora of techniques
  • IRT techniques (FH 2005)
  • SEM techniques (MIMIC model) (FH 2005)
  • (Ordinal) logistic regression techniques
  • Hybrid ordinal logistic regression / IRT
    technique (FH 2005)
  • Each technique hinges on identifying items with
    no DIF to serve as anchors

46
IRT DIF detection- principles
  • Compare IRT curves calibrated separately from the
    group(s)
  • In the absence of DIF they will be superimposed
  • What is theta?
  • Iterative algorithms for identifying items that
    are free of DIF to serve as anchors
  • Statistical significance typically used
  • Also Raju developed tests based on area between
    curves (mathematically equivalent to statistical
    significance tests)

47
SEM techniques - principles
  • Measurement model with covariates
  • Initially specify a model with a main effect for
    the covariate, and indirect effects for the
    covariate on each of the items, initially set to
    0 (see next slide)
  • Look at Modification Indices, improvements of fit
    for freeing one constraint at a time

48
MIMIC model picture
F
X
0
0
0
0
0
0
X 1
X 4
X 3
X 2
X 6
X 5
49
OLR - principles
  • Fit nested (ordinal) logistic regression models
  • P(Y1X,g)ß1Xß2gß3(Xg) (model 1)
  • P(Y1X,g)ß1Xß2g (model 2)
  • P(Y1X,g)ß1X (model 3)
  • NUDIF is statistical significance of the
    interaction term (LL difference between models 1
    and 2)
  • UDIF
  • statistical significance of the group term
    (Hambleton et al, 1991)
  • Proportional change in ß1 from models 2 and 3
    (Crane et al. 2004)

50
Hybrid IRT-OLR - principles
  • Techniques that rely on standard sum score (X) at
    least theoretically flawed (Millsap and Everson)
  • Substitute IRT-derived ? for X (Crane et al.
    2004)
  • P(Y1?,g)ß1?ß2gß3(?g) (model 1)
  • P(Y1?,g)ß1?ß2g (model 2)
  • P(Y1?,g)ß1? (model 3)

51
What to do when we find DIF?
  • Ignore
  • Omit items found with DIF from the scale
  • Account for DIF items by generating and using
    demographic-specific item parameters (Reise et
    al.)

52
Demographic specific item parameters data
structure
53
DIF presence vs. DIF impact
  • So far have addressed the question Is it there?
  • Have not addressed the question Does it matter?
  • We can determine how much individual scores
    change, and display graphically
  • Executive functioning items analyzed at FH 2005
    (see next slide)

54
DIF impact
55
Relationships to external criteria
  • Concurrent head MRI with volumes of white matter
    hyperintensities and total brain volume
  • Amount of variance in scores explained by MRI
    measures
  • Total sum score 0.11 IRT score 0.13. IRT
    score accounting for DIF 0.16
  • Removing the nuisance of DIF (Shealy and Stout)
  • Publicly available STATA code (FH 2006)
Write a Comment
User Comments (0)
About PowerShow.com