Steven Viger

About This Presentation
Title:

Steven Viger

Description:

Steven Viger Lead Psychometrician Michigan Department of Education Office of Educational Assessment and Accountability Student Performance Measurement The previous ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Steven Viger


1
Measurement 102
  • Steven Viger
  • Lead Psychometrician
  • Michigan Department of Education
  • Office of Educational Assessment and
    Accountability

2
Student Performance Measurement
  • The previous session discussed some basic
    mechanics involved in psychometric analysis.
  • Graphical and statistical methods
  • The focus of this session is on the
    interpretations of the data in light of the often
    used terms reliability.
  • Some attention will also be paid to some of the
    higher level psychometrics that go on behind the
    scenes.
  • How the scale scores are REALLY made!

3
Making inferences from measurements
  • The inferences one can make based solely on
    educational measurement are limited.
  • The extent of the limitation is largely a
    function of whether or not evidence of the valid
    use of scores is accumulated.
  • At times, the terms validity and reliability are
    confused. Unfortunately, these terms describe
    extremely different concepts.

4
Some basic validity definitions
  • Validity
  • The degree to which the assessment measures the
    intended construct(s)
  • Answers the question, are you measuring what you
    think you are?
  • More contemporary definitions focus on the
    accumulation of evidence for the validity of the
    inferences and interpretations made from the
    scores produced.

5
Some basic reliability definitions
  • Reliability
  • Consistency
  • The degree to which students would be ranked
    ordered the same if they were to be administered
    the same assessment numerous times.
  • Actually, the assumption is based on an infinite
    amount of retesting with no memory of the
    previous administrationsan unrealistic scenario.

6
More about reliability
  • Reliability is one of the most fundamental
    requirements for measurementif the measures are
    not reliable, then it is difficult to support
    claims that the measures can be valid for any
    particular decision.
  • Reliability refers to the degree to which
    instrument scores for a group of participants are
    consistent over repeated applications of a
    measurement procedure and are, therefore,
    dependable and repeatable.

7
  • X T E
  • True Score (T) A theoretic score for a person on
    an instrument that is equal to the average score
    for that person over an infinitely large number
    of retakes.
  •  
  • Error (E) The degree to which an observed score
    (X) varies from the persons theoretical true
    score (T).
  • In this context, reliability refers to the degree
    to which scores are free of measurement errors
    for a particular group if we assume the
    relationship of observed and true scores are
    depicted as above.

8
Unreliability AKA the standard error of
measurement
  • The standard error of measurement (SEM) is an
    estimate of the amount of error present in a
    students score.
  • If X T E, the SEM serves as a general estimate
    of the E portion of the equation.
  • There is an inverse relationship between the SEM
    and reliability. Tests with higher reliability
    have smaller SEMs.  
  • Reliability coefficients are indicators that
    reflect the degree to which scores are free of
    measurement error.

9
More on the Standard Error of Measurement
  • The smaller the SEM for a test (and, therefore,
    the higher the reliability), the greater one can
    depend on the ordering of scores to represent
    stable differences between students.
  • The higher the reliability, the more likely it is
    that the rank ordering of students by score is
    due to differences in true ability rather than
    random error.
  • The higher the reliability, the more confident
    you can be in the observed score, X, being an
    accurate estimate of the students true score, T.

10
Standards for Reliability
  • There are no mathematical rules to determine
    what constitutes an acceptable reliability
    coefficient.
  • Some advice
  • Individual based decisions should be based on
    scores produced from highly precise instruments.
  • The higher the stakes, the higher you will want
    your reliability to be.
  • Group-based decisions in a research setting
    typically allow lower reliability.
  • If you are making high-stakes decisions about
    individuals, you need reliabilities above .80 and
    preferably in the .90s.

11
Establishing validity
  • Past practice has been to treat validity as if
    there criterion related to an amount necessary to
    deem an instrument as valid.
  • That practice is outdated and inappropriate.
  • Does not acknowledge that numerous pieces of
    information need to come together to facilitate
    valid inferences.
  • Tends to discount some pieces of evidence and
    over emphasize others.
  • Leads to a narrowing of scope and can encourage
    one to be limited in their approach to gathering
    evidence.

12
Process vs. Product
  • Rather than speak of validity as a thing, we need
    to start approaching it as an on-going process
    that is fed from all aspects of a testing
    program validation.
  • The current AERA and APA standards for validity
    tend to treat the validation process similar to a
    civil court proceeding.
  • A preponderance of the evidence is sought with
    the evidence coming from multiple sources.

13
Validation from item evidence
  • Focus is on elimination of construct-irrelevant
    variance
  • Some ways this is accomplished
  • Well established item development/review
    procedures
  • Demonstrate alignment of individual items to
    standards
  • Show the items/assessments are free of bias
    quantitatively and qualitatively
  • Simple item analyses eliminate items with
    questionable stats (e.g. p-values too high, low
    point-biserial correlation, etc.)

14
Validation from scaled scores
  • Scale score level validity evidence includes but
    is not limited to
  • Input from item-level validity evidence (the
    validity of the score scale depends upon the
    validity of the items that contribute to that
    score scale)
  • Convergent and divergent relationships with
    appropriate external criteria.
  • Reliability evidence
  • Appropriate use of a strong measurement model
    for the production of student scores.

15
Is it valid, reliable, or both?
16
Measurement models
  • The measurement models used by MDE fall under the
    general category of Item Response Theory (IRT)
    models.
  • IRT models depict the statistical relationship
    that occurs as a result of person /item
    interactions.
  • Specifically, statistical information regarding
    the persons and the items are used to predict the
    probability of correctly responding to a
    particular item if the item is constructed
    response it is the probability of a person
    receiving a specific score point from the rubric.
  • Like all statistically based models, IRT models
    carry with them some assumptions some are
    theoretical whereas others are numerical.

17
IRT assumptions
  • Unidimensionality there is a single underlying
    construct being measured by the assessment (i.e.
    mathematics achievement, writing achievement,
    etc.)
  • As a result of the assumption of the single
    construct, the model dictates we treats all
    sub-components (strand level, domain, subscales
    in general) as contributing to the single
    construct
  • Assumes that there is a high correlation between
    sub-components
  • It would probably be better to measure the
    sub-components separately, but that would require
    significantly more assessment items to attain
    decent reliability

18
IRT assumptions
  • Assumes that a more able person has a higher
    probability of responding correctly to an item
    than a less able person
  • Specifically, when a persons ability is greater
    than the item difficulty, they have a better than
    50 chance of getting the item correct.
  • Local independence the response to one item is
    independent of and does not influence your
    probability of responding correctly to another
    item.
  • The data fit the model!
  • The item and person parameter estimates are
    reasonable representations of reality and the
    data collected meets the IRT model assumptions.

19
The Rasch Model(MEAP and ELPA)
20
The Rasch Model (1 parameter logistic model)
  • An item characteristic curve for a sample MEAP
    item

21
The 3 Parameter Logistic Model(MME and MEAP
Writing)
22
The 3 Parameter Logistic Model
  • An item characteristic curve for a sample MME
    item.

23
  • Before I show you what a string of items looks
    like using IRT Id like to first point out some
    differences in the model that will lead to some
    major differences in the way the items look
    graphically.
  • In particular, we need to pay attention to the
    differences in the formulas.
  • Are there features of the 3PL model that do not
    appear in the 1PL model?

24
  • In both models, the quantity driving the solution
    to the equation is the difference between person
    ability and item difficulty ? - b.
  • However, in one model, that relationship is
    altered and we cannot rely on the difference
    between ability and difficulty alone to determine
    the probability of a correct response to an item.

25
1PL vs. 3PL
  • In the 1 parameter model, the item difficulty
    parameter (assuming the students ability is a
    known and fixed quantity), and its difference
    from student ability drives the probability of a
    correct response. All other elements are
    constants in the equation.
  • Hence the name, 1 parameter model
  • Therefore, when you see the plots of multiple
    items, they should only differ by a constant in
    terms of their location on the scale.

26
1PL vs. 3PL
  • In the 3 parameter model, there are still
    constants and the difference between ability and
    difficulty is still the critical piece. However,
    a, the discrimination parameter, has a
    multiplicative affect on the difference between
    ability and difficulty. Furthermore, the minimum
    possible result for the equation is influenced by
    the c parameter.
  • If c gt 0.00, the probability of correct response
    must be greater than 0.
  • Item characteristic curves will vary by location
    on the scale as well as by origin (c parameter)
    and slope (a parameter).
  • Knowing how difficult an item is compared to
    another is still relevant but is not the only
    piece of information that leads to item
    differences.

27
MEAP example (10 items scaled using Rasch)
28
MME example (10 items scaled using the 3-PL
model)
29
How do we get there?
  • Although the graphics and equations on the
    previous screens may make conceptual sense, you
    may have noticed that the solution to the
    equations depends on knowledge of the values of
    some of the variables.
  • We are psychometriciansnot psychomagicians, so
    the numbers come from somewhere.
  • The item and person parameters have to be
    estimated.
  • We need a person by item matrix to begin the
    process.

30
IRT Estimation
  • The person by item matrix is fed into an IRT
    program to produce estimates of item parameters
    and person parameters.
  • An estimation algorithm is used, which is
    essentially a predefined process with stop and
    go rules. The end products are best estimates of
    the item parameters and person ability estimates.
  • Item parameters are the guessability,
    discrimination and difficulty parameters
  • Person parameters are the ability estimates we
    use to create a students scale score.

31
Parameter Estimation
  • For single parameter (item difficulty) models,
    WINSTEPS is the industry standard.
  • More complex models like the 3 parameter model
    used in the MME require more specialized software
    such as PARSCALE.
  • The estimation process is iterative but happens
    very quickly most programs converge in less than
    10 seconds.
  • Typically, item parameters are estimated followed
    by person ability parameters.

32
Estimating Ability
  • Once item parameters are known, we can use the
    item responses for the individuals to estimate
    their ability (theta).
  • For the 3PL model, when people share the same
    response string (pattern of correct and incorrect
    responses) they will have the same estimate of
    theta.
  • In the 1PL model, the raw score is used to derive
    the thetas.
  • Essentially, the same raw score will generate
    different estimates of theta but they are close.
    The program will create a table that relates raw
    scores, to theta, to scale scores based on
    maximum likelihood estimation.

33
From theta to scale score
  • Remember the following formula?
  • y mx b
  • That is an example of a linear equation.
  • MDE uses linear equations to transform thetas to
    scale scores.
  • There is a different transformation for each
    grade and content area.
  • Performance levels are determined by the
    students scale score.
  • Cut scores are produced by standard setting
    panelists.

34
Summary
  • In this session you found out a bit about
    reliability and validity.
  • Two important pieces of information for any
    assessment.
  • Remember, it is the validity of the inferences we
    make that is important.
  • The evidence is accumulated and the process is
    ongoing.
  • There are no types of validity.
  • You were also introduced to item response theory
    models and how they are used to produce MDE scale
    scores.
  • The hope is that you leave with a greater
    understanding of how MDE assessments are scored,
    scaled, and interpreted.
  • In addition, you now have some tools that can
    assist you in your own analyses.

35
Contact Information
  • Steve Viger
  • Michigan Department of Education608 W. Allegan
    St.Lansing, MI 48909 (517) 241-2334VigerS_at_Mich
    igan.gov
Write a Comment
User Comments (0)