Title: Steven Viger
1Measurement 102
- Steven Viger
- Lead Psychometrician
- Michigan Department of Education
- Office of Educational Assessment and
Accountability
2Student Performance Measurement
- The previous session discussed some basic
mechanics involved in psychometric analysis. - Graphical and statistical methods
- The focus of this session is on the
interpretations of the data in light of the often
used terms reliability. - Some attention will also be paid to some of the
higher level psychometrics that go on behind the
scenes. - How the scale scores are REALLY made!
3Making inferences from measurements
- The inferences one can make based solely on
educational measurement are limited. - The extent of the limitation is largely a
function of whether or not evidence of the valid
use of scores is accumulated. - At times, the terms validity and reliability are
confused. Unfortunately, these terms describe
extremely different concepts.
4Some basic validity definitions
- Validity
- The degree to which the assessment measures the
intended construct(s) - Answers the question, are you measuring what you
think you are? - More contemporary definitions focus on the
accumulation of evidence for the validity of the
inferences and interpretations made from the
scores produced.
5Some basic reliability definitions
- Reliability
- Consistency
- The degree to which students would be ranked
ordered the same if they were to be administered
the same assessment numerous times. - Actually, the assumption is based on an infinite
amount of retesting with no memory of the
previous administrationsan unrealistic scenario.
6More about reliability
- Reliability is one of the most fundamental
requirements for measurementif the measures are
not reliable, then it is difficult to support
claims that the measures can be valid for any
particular decision. - Reliability refers to the degree to which
instrument scores for a group of participants are
consistent over repeated applications of a
measurement procedure and are, therefore,
dependable and repeatable.
7- X T E
- True Score (T) A theoretic score for a person on
an instrument that is equal to the average score
for that person over an infinitely large number
of retakes. -
- Error (E) The degree to which an observed score
(X) varies from the persons theoretical true
score (T). - In this context, reliability refers to the degree
to which scores are free of measurement errors
for a particular group if we assume the
relationship of observed and true scores are
depicted as above.
8Unreliability AKA the standard error of
measurement
- The standard error of measurement (SEM) is an
estimate of the amount of error present in a
students score. - If X T E, the SEM serves as a general estimate
of the E portion of the equation. - There is an inverse relationship between the SEM
and reliability. Tests with higher reliability
have smaller SEMs. - Reliability coefficients are indicators that
reflect the degree to which scores are free of
measurement error.
9More on the Standard Error of Measurement
- The smaller the SEM for a test (and, therefore,
the higher the reliability), the greater one can
depend on the ordering of scores to represent
stable differences between students. - The higher the reliability, the more likely it is
that the rank ordering of students by score is
due to differences in true ability rather than
random error. - The higher the reliability, the more confident
you can be in the observed score, X, being an
accurate estimate of the students true score, T.
10Standards for Reliability
- There are no mathematical rules to determine
what constitutes an acceptable reliability
coefficient. - Some advice
- Individual based decisions should be based on
scores produced from highly precise instruments. - The higher the stakes, the higher you will want
your reliability to be. - Group-based decisions in a research setting
typically allow lower reliability. - If you are making high-stakes decisions about
individuals, you need reliabilities above .80 and
preferably in the .90s.
11Establishing validity
- Past practice has been to treat validity as if
there criterion related to an amount necessary to
deem an instrument as valid. - That practice is outdated and inappropriate.
- Does not acknowledge that numerous pieces of
information need to come together to facilitate
valid inferences. - Tends to discount some pieces of evidence and
over emphasize others. - Leads to a narrowing of scope and can encourage
one to be limited in their approach to gathering
evidence.
12Process vs. Product
- Rather than speak of validity as a thing, we need
to start approaching it as an on-going process
that is fed from all aspects of a testing
program validation. - The current AERA and APA standards for validity
tend to treat the validation process similar to a
civil court proceeding. - A preponderance of the evidence is sought with
the evidence coming from multiple sources.
13Validation from item evidence
- Focus is on elimination of construct-irrelevant
variance - Some ways this is accomplished
- Well established item development/review
procedures - Demonstrate alignment of individual items to
standards - Show the items/assessments are free of bias
quantitatively and qualitatively - Simple item analyses eliminate items with
questionable stats (e.g. p-values too high, low
point-biserial correlation, etc.)
14Validation from scaled scores
- Scale score level validity evidence includes but
is not limited to - Input from item-level validity evidence (the
validity of the score scale depends upon the
validity of the items that contribute to that
score scale) - Convergent and divergent relationships with
appropriate external criteria. - Reliability evidence
- Appropriate use of a strong measurement model
for the production of student scores.
15Is it valid, reliable, or both?
16Measurement models
- The measurement models used by MDE fall under the
general category of Item Response Theory (IRT)
models. - IRT models depict the statistical relationship
that occurs as a result of person /item
interactions. - Specifically, statistical information regarding
the persons and the items are used to predict the
probability of correctly responding to a
particular item if the item is constructed
response it is the probability of a person
receiving a specific score point from the rubric. - Like all statistically based models, IRT models
carry with them some assumptions some are
theoretical whereas others are numerical.
17IRT assumptions
- Unidimensionality there is a single underlying
construct being measured by the assessment (i.e.
mathematics achievement, writing achievement,
etc.) - As a result of the assumption of the single
construct, the model dictates we treats all
sub-components (strand level, domain, subscales
in general) as contributing to the single
construct - Assumes that there is a high correlation between
sub-components - It would probably be better to measure the
sub-components separately, but that would require
significantly more assessment items to attain
decent reliability
18IRT assumptions
- Assumes that a more able person has a higher
probability of responding correctly to an item
than a less able person - Specifically, when a persons ability is greater
than the item difficulty, they have a better than
50 chance of getting the item correct. - Local independence the response to one item is
independent of and does not influence your
probability of responding correctly to another
item. - The data fit the model!
- The item and person parameter estimates are
reasonable representations of reality and the
data collected meets the IRT model assumptions.
19The Rasch Model(MEAP and ELPA)
20The Rasch Model (1 parameter logistic model)
- An item characteristic curve for a sample MEAP
item
21The 3 Parameter Logistic Model(MME and MEAP
Writing)
22The 3 Parameter Logistic Model
- An item characteristic curve for a sample MME
item.
23- Before I show you what a string of items looks
like using IRT Id like to first point out some
differences in the model that will lead to some
major differences in the way the items look
graphically. - In particular, we need to pay attention to the
differences in the formulas. - Are there features of the 3PL model that do not
appear in the 1PL model?
24- In both models, the quantity driving the solution
to the equation is the difference between person
ability and item difficulty ? - b. - However, in one model, that relationship is
altered and we cannot rely on the difference
between ability and difficulty alone to determine
the probability of a correct response to an item.
251PL vs. 3PL
- In the 1 parameter model, the item difficulty
parameter (assuming the students ability is a
known and fixed quantity), and its difference
from student ability drives the probability of a
correct response. All other elements are
constants in the equation. - Hence the name, 1 parameter model
- Therefore, when you see the plots of multiple
items, they should only differ by a constant in
terms of their location on the scale.
261PL vs. 3PL
- In the 3 parameter model, there are still
constants and the difference between ability and
difficulty is still the critical piece. However,
a, the discrimination parameter, has a
multiplicative affect on the difference between
ability and difficulty. Furthermore, the minimum
possible result for the equation is influenced by
the c parameter. - If c gt 0.00, the probability of correct response
must be greater than 0. - Item characteristic curves will vary by location
on the scale as well as by origin (c parameter)
and slope (a parameter). - Knowing how difficult an item is compared to
another is still relevant but is not the only
piece of information that leads to item
differences.
27MEAP example (10 items scaled using Rasch)
28MME example (10 items scaled using the 3-PL
model)
29How do we get there?
- Although the graphics and equations on the
previous screens may make conceptual sense, you
may have noticed that the solution to the
equations depends on knowledge of the values of
some of the variables. - We are psychometriciansnot psychomagicians, so
the numbers come from somewhere. - The item and person parameters have to be
estimated. - We need a person by item matrix to begin the
process.
30IRT Estimation
- The person by item matrix is fed into an IRT
program to produce estimates of item parameters
and person parameters. - An estimation algorithm is used, which is
essentially a predefined process with stop and
go rules. The end products are best estimates of
the item parameters and person ability estimates. - Item parameters are the guessability,
discrimination and difficulty parameters - Person parameters are the ability estimates we
use to create a students scale score.
31Parameter Estimation
- For single parameter (item difficulty) models,
WINSTEPS is the industry standard. - More complex models like the 3 parameter model
used in the MME require more specialized software
such as PARSCALE. - The estimation process is iterative but happens
very quickly most programs converge in less than
10 seconds. - Typically, item parameters are estimated followed
by person ability parameters.
32Estimating Ability
- Once item parameters are known, we can use the
item responses for the individuals to estimate
their ability (theta). - For the 3PL model, when people share the same
response string (pattern of correct and incorrect
responses) they will have the same estimate of
theta. - In the 1PL model, the raw score is used to derive
the thetas. - Essentially, the same raw score will generate
different estimates of theta but they are close.
The program will create a table that relates raw
scores, to theta, to scale scores based on
maximum likelihood estimation.
33From theta to scale score
- Remember the following formula?
- y mx b
- That is an example of a linear equation.
- MDE uses linear equations to transform thetas to
scale scores. - There is a different transformation for each
grade and content area. - Performance levels are determined by the
students scale score. - Cut scores are produced by standard setting
panelists.
34Summary
- In this session you found out a bit about
reliability and validity. - Two important pieces of information for any
assessment. - Remember, it is the validity of the inferences we
make that is important. - The evidence is accumulated and the process is
ongoing. - There are no types of validity.
- You were also introduced to item response theory
models and how they are used to produce MDE scale
scores. - The hope is that you leave with a greater
understanding of how MDE assessments are scored,
scaled, and interpreted. - In addition, you now have some tools that can
assist you in your own analyses.
35Contact Information
- Steve Viger
- Michigan Department of Education608 W. Allegan
St.Lansing, MI 48909 (517) 241-2334VigerS_at_Mich
igan.gov