Measurement 102 - PowerPoint PPT Presentation

About This Presentation
Title:

Measurement 102

Description:

Steven Viger Lead Psychometrician, Office of General Assessment and Accountability, Michigan Dept. of Education Joseph Martineau, Ph.D. Interim Director, Office of ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 35
Provided by: Instruc53
Learn more at: https://www.michigan.gov
Category:

less

Transcript and Presenter's Notes

Title: Measurement 102


1
Measurement 102
  • Steven Viger
  • Lead Psychometrician, Office of General
    Assessment and Accountability, Michigan Dept. of
    Education
  • Joseph Martineau, Ph.D.
  • Interim Director, Office of General Assessment
    and Accountability, Michigan Dept. of Education

2
Student Performance Measurement
  • The difference between validity and reliability
  • Validity
  • The degree to which the assessment measures the
    intended construct(s)
  • Reliability
  • The consistency with which the assessment
    produces scores

3
Student Performance Measurement
4
Student Performance Measurement
  • Validity
  • Documenting validity is a process of gathering
    evidence that the assessment measures what is
    intended

5
Student Performance Measurement
  • Individual item validity evidence includes
  • Focus is on elimination of construct-irrelevant
    variance
  • Item development/review procedures
  • Alignment of individual items
  • Bias
  • Simple item analyses

6
Student Performance Measurement
  • Scale score validity evidence includes
  • Input from item-level validity evidence (the
    validity of the score scale depends upon the
    validity of the items that contribute to that
    score scale)
  • Convergent, divergent relationships with
    appropriate external criteria, for example,
  • Strong relationships with other measures of
    achievement
  • Teacher assigned grades
  • Other subject area assessments
  • Success in college
  • Alignment of overall assessment to content
    standards
  • Comparability across forms and administrations
  • Accommodations
  • Year to year equating

7
Student Performance Measurement
  • Item Response Theory is used to create the score
    scale
  • Treats all sub-components as a single construct
  • Assumes that there is a high correlation between
    sub-components
  • Statistically speaking, this indicates that there
    is a strong first principal component of all
    items that contribute to the construct in
    question
  • It would probably be better to measure the
    sub-components separately, but that would require
    significantly more assessment items
  • Assumes that a more able person has a higher
    probability of responding correctly to an item
    than a less able person
  • Specifically, when a persons ability is greater
    than the item difficulty, they have a better than
    50 chance of getting the item correct.

8
The Rasch Model (1 parameter logistic model)
  • The psychometric/statistical model used for MEAP

9
The 3 Parameter Logistic Model
  • The psychometric/statistical model used with the
    MME

10
MEAP example (10 items scaled using Rasch)
11
MME model (10 items scaled using the 3-PL model)
12
How do we get there?
  • Although the graphics on the previous screens may
    make conceptual sense, many wonder what we use to
    produce this curve.
  • We are psychometriciansnot psychomagicians, so
    the numbers come from somewhere.
  • We need a person by item matrix to begin the
    process.

13
IRT Estimation
  • The person by item matrix is fed into an IRT
    program to produce estimates of item parameters
    and person parameters.
  • Item parameters are the guessability,
    discrimination and difficulty parameters
  • Person parameters are the ability estimates we
    use to create a students scale score.

14
Parameter Estimation
  • For single parameter (item difficulty) models,
    WINSTEPS is the industry standard.
  • More complex models like the 3 parameter model
    used in the MME require more specialized software
    such as PARSCALE.
  • Once we know the parameters, we feed them into
    the appropriate model to give us an estimated
    probability of correct response.

15
The Rasch Model(MEAP and ELPA)
16
The 3 Parameter Logistic Model(MME)
17
From Theta to Scale Scores
  • Once item parameters are known, we can use the
    item responses for the individuals to estimate
    their ability (theta).
  • In general, when persons share the same response
    string (pattern of correct and incorrect
    responses) they will have the same estimate of
    theta.
  • The estimation program will then produce a table
    that gives us the relationship between raw scores
    and theta.

18
(No Transcript)
19
DIF
  • Differential item functioning (DIF) is a
    phenomenon which occurs in the context of testing
    multiple groups.
  • When we estimate item and person parameters we do
    so with a complete data set the persons are from
    multiple demographic groups.
  • We do our best to have a representative sample.
  • DIF occurs when we estimate the item calibrations
    separately based on groups of interest (e.g.
    males and females, ethnicity, type of
    instruction, geographic regions, etc.) and we
    find differences in item parameters.
  • Depending on the DIF methodology used, there are
    different levels of DIF that may or be not be
    problematic.

20
DIF
  • Once DIF is detected the item(s) is/are brought
    to the attention of the content specialists and
    at times the content and sensitivity review
    committees.
  • Items are either edited, deleted or kept in the
    assessment depending on the findings of the
    reviewers.
  • The bottom line is that the finding of DIF from a
    statistical standpoint does not necessarily mean
    the item will be deleted.
  • Subjective decisions always follow the technical
    information.

21
Item Types
  • MDE assessments contain a variety of items with
    different levels of maturity.
  • Even though an assessment is new for a test
    cycle, there are parts of it which are not new.
  • Generally speaking, we have core items and field
    test items.

22
Equating
  • Core items are more established and have been
    used before. In fact, the core items are the only
    ones which contribute to the score.
  • Field test items are embedded within assessments
    to maintain the health of our item banks.
  • We can treat the item parameters from core items
    as known and use those known parameters to
    drive the estimation of field test item
    parameters.
  • Equating also utilizes what we know about the
    common items to link assessments from year to
    year because common items are used in concurrent
    test years.

23
Equating
  • When we have a test designed to measure the same
    construct from year to year but differs
    (somewhat) in specific content, we need to be
    able to put the scores on the same scale.
  • What is being sought in test equating is a
    conversion from the units of one form of a test
    to the units of another form of the same test.

24
Equating
  • Three restrictions are important for equating
  • The two tests must measure the same construct
  • The resulting conversion should be independent of
    the individuals from whom the data were drawn to
    develop the conversion
  • The conversions should be applicable in future
    situations.

25
Equipercentile Equating
  • Two scores, one on Form X and the other on Form Y
    (where X and Y measure the same thing with the
    same degree of reliability), may be considered
    equivalent if their corresponding percentile
    ranks in any giving group are equal.
  • Plot the percentile rank to raw score curves for
    each form
  • Paired values for the forms are then interpolated
    at common points on the percentile rank
    distribution.

26
Linear Equating
  • Based on the assumption that the two forms of the
    test, designed to be parallel (equivalent), will
    have essentially the same raw score
    distributions.
  • When the assumption is met, it should be possible
    to convert scores on one form of the measure into
    the same metric as the other form by employing a
    linear function.

27
Linear Equating
  • Analogous to multiple regression
  • Generally expressed as Y a(X-c) d
  • a refers to the ration of the standard
    deviations of Form Y over the standard deviation
    of Form X
  • c refers to the mean of form X
  • d refers to the mean of form Y
  • To perform this type of equating, one of three
    basic data collection designs should be used.

28
Basic Data Collection Designs
  • Design 1 a large group of examinees are selected
    who are sufficiently heterogeneous in order to
    adequately sample all levels of the scores on
    both Form X and Form Y.
  • Divide this randomly into two groups, each group
    gets a different form.
  • Collect the mean and standard deviations of both
    groups and insert them in the aforementioned
    linear function.

29
Design 2
  • Preferable when the test administrator has the
    luxury of more time to administer tests.
  • Both form X and form Y are administered to all
    subjects.
  • To control for order effects, half of the
    subjects receive form X first and half receive
    form Y first.
  • Calculations are slightly more involved because
    averages must be used and must be applied
    properly.

30
Design 3
  • Two randomly assigned groups each take a
    different test along with a common equating test.
  • Common test is known as the anchor test (form
    Z)
  • Again, calculations are complicated by the
    addition of the anchor test
  • Benefits it is the industry standard, intact or
    non-random groups may be used, the anchor test is
    designed to adjust for any between group
    differences that may be present

31
IRT Equating
  • When using IRT, if the data fit the model
    reasonably well, the item and ability parameters
    are invariant.
  • For a set of calibrated items, an examinee will
    be expected to obtain the same ability estimate
    from any subset of items.
  • For any sub-sample of examinees, item parameters
    will be the same.

32
IRT Equating Contd
  • Based on the principals in the previous slide and
    the common anchor item methodology, MDE utilizes
    various forms of this equating methodology.
  • Generally, when we embed core items into
    examinations from year to year, we already know
    the difficulty estimates of those items.
  • When we perform our IRT estimation during the
    current year or with the form under
    investigation, we can fix the parameters of
    those items when we feed our data into an
    estimation program.
  • The new ability estimates (and field test item
    parameter estimates) are anchored to the previous
    form or previous years administration.

33
Contact Information
  • Steven Viger
  • Michigan Department of Education608 W. Allegan
    St.Lansing, MI 48909Office (517)
    241-2334Fax (517) 335-1186
  • VigerS_at_Michigan.gov

34
Contact Information
  • Joseph Martineau
  • Michigan Department of Education608 W. Allegan
    St.Lansing, MI 48909
  • Office (517) 241-4710Fax (517) 335-1186
  • MartineauJ_at_Michigan.gov
Write a Comment
User Comments (0)
About PowerShow.com