Title: Measurement 101
1Measurement 101
- Steven Viger
- Lead Psychometrician, Office of General
Assessment and Accountability, Michigan Dept. of
Education - Joseph Martineau, Ph.D.
- Interim Director, Office of General Assessment
and Accountability, Michigan Dept. of Education
2Agenda
- This session will introduce you to some of the
basic psychometric techniques utilized by the
Department of Education. - The focus of this session will be on techniques
grounded in Classical Test Theory (CTT). - Analyses driven by the raw score metric at both
the test and item level. - Specifically, the basic analysis of total test
score and item scores will be presented in the
context of instrument quality and functioning. - Prior to discussing the specific indicators we
will begin with a brief primer on statistical
concepts necessary to fully appreciate the
analytic techniques.
3Statistics
- While it is not necessarily the case that a
Psychometrician is a Statistician, it is
necessary to have a fairly sophisticated
understanding of statistics to fully appreciate
the mechanics of psychometric analyses. - Formulas for determining various psychometric
indicators are in a sense recipes. - If we do not provide the proper ingredients, at
the proper time, and in a manner consistent with
the recipe success is not likely. - Therefore, we will begin with a description of
the most common ingredients used in item and
reliability analyses. - Additionally, we will discuss the common
operators (i.e. rules for adding and mixing the
ingredients) encountered in the analyses. - The goal is to not only show you the formulas but
to also help you truly understand the mechanics
of the analyses.
4Back to Basics
- Most psychometric formulas are mixtures of very
basic and common statistics. - In classical analyses we are typically operating
on a collection of data. We conceptualize this
data as forming a distribution of scores. - An example of a common distribution is the normal
distribution or bell curve. - Psychometric analyses always utilize summary
measures of either the whole distribution or
specific chunks of the distribution - a measure of central tendency an indicator of
where most of the data reside in the distribution - a measure of variability an indicator of how
much (or how little) the data is spread out
around the measure of central tendency.
5- One first step in summarizing data is to create a
frequency distribution showing the number of
times a given data value occurred in your sample.
- It is likely that you will encounter frequency
distributions which group the data into ranges or
intervals, and counts are provided to show
numbers or percentages of observations within the
defined intervals. - True frequency distributions count the occurrence
of each individual score (possible scores).
6Histograms
- The graphic version of a frequency distribution
that uses intervals is a histogram (sometimes
incorrectly called a bar chart). - We examine frequency distributions and histograms
for each variable of interest to get an
impression of the overall shape of the data and
to see whether there are outliers in the data.
7For simplicity of expression, we use symbols to
represent various concepts and operations in
statistics. VariablesThe codes (often numerical
codes) we use to describe the constructs were
interested in. Variables are indicated by
upper-case letters (X, Y). Individual values are
represented using subscripts (Xi, Yj).
SummationWe frequently need to add a series of
observations for a variable in order to summarize
that variable or to perform other operations. The
Greek upper-case sigma (S) is used to symbolize
this.
8Say the data are these pretest scores 8, 7, 5,
10 X1 would be the score of the first person in
the data set. Here X1 8. The first score is
not necessarily the largest (or smallest) score,
because we dont assume the scores are
ordered. Xi is the ith score (or any case) and
although there is no implied order, the data are
arranged in a certain way meaning that we can use
their layout to specify positions -- here you
select what value of i you are interested
in If i 3, Xi 5. If i N, then here N
4, so Xn 10. Saying Xi, for i 1 to N means
the set of all N scores.
9Summation Notation Applied
- Frequently, it is clear that we want to sum all
values of X, so we can simply write - which equals (X1 X2 X3 XN)
and means the same thing as -
- Other common summations
- The sum of the squared values of X
- The square of the sum of X
10- There are three primary measures of central
tendency - Mode (Mo) The most frequently occurring data
value. - Median (Med) When the data are rank ordered,
the middle value (or average of middle values
when there is an even number of observations).
The median, therefore, represents the 50th
percentile of the data values. - Mean (also or X-bar) The arithmetic
average. Obtained by adding all data values and
dividing by the number of observations
11Mean, Med, Mo
The mean, median, and mode are equal ONLY when
the distribution is symmetrical and unimodal
(i.e. normal). When the distribution is
skewed and unimodal, the mode will be the hump
in the distribution. The mean will be pulled out
toward the tail of the skew. The median will
likely be between the other two values.
Mo
Med
Mean
12- Another characteristic of a distribution that we
may wish to summarize is its dispersion or spread
on the underlying continuum. - For example, in the plot below, the blue and red
distributions have the same measure of central
tendency, but the red one is more widely
dispersed (wider and flatter) along the X-axis.
Another way to say this is that the data are more
spread out about the measure of central tendency.
13- Some common measures of variability
- Range The difference between the two most
extreme data points (maximum minimum). - Variance ( s2x or s-squared sub X) The
average squared deviation of scores from the
mean. The numerator of the equation is commonly
referred to as the sum of squares (SS). The
reason we square the deviations from the mean is
to eliminate negative numbers and also to avoid
the strong possibility of summing to zero
14Measures of Variability
- Standard Deviation (sxs-sub X) The average
absolute deviation of scores from the meanalso
the square root of the variance
15Reliability
- Reliability refers to the degree to which
instrument scores for a group of participants are
consistent over repeated applications of a
measurement procedure and are, therefore,
dependable and repeatable. - The definition of repeated applications is
situation dependent. - Reliability is one of the most fundamental
requirements for measurementif the measures are
not reliable, then it is difficult to support
claims that the measures can be valid for any
particular decision.
16True Score A theoretic score for a person on an
instrument that is equal to the average score for
that person over an infinitely large number of
retakes. We estimate this value using the
persons score on a single administration of the
instrument. Error The degree to which an
observed score (X) varies from the persons
theoretical true score (T). Error is designated
E. X T E In measurement, reliability refers
to the degree to which scores are free of
measurement errors for a particular group if we
assume the relationship of observed and true
scores are depicted as above.
17Standard Error of Measurement
- The standard error of measurement (SEM) is an
estimate of the amount of error present in a
students score. - If X T E, the SEM serves as a general estimate
of the E portion of the equation. - There is an inverse relationship between the SEM
and reliability. Tests with higher reliability
have smaller SEMs relative to the standard
deviation of the test score.
18More on the Standard Error of Measurement
- The smaller the SEM for a test (and, therefore,
the higher the reliability), the greater one can
depend on the ordering of scores to represent
stable differences between students. - The more confident you can be in the observed
score, X, being an accurate estimate of the
students true score, T. - The converse also holds. That is, the larger the
SEM, the more error is likely present in the test
scores for each student. Therefore, the less
confident one should be about the stability of
the ordering of students on the basis of their
test scores. - What this usually means is that the spread of
scores among the students is due more to error in
measuring their knowledge about the course
content than to what the students actually know
or have learned.
19Reliability Coefficients (in general)
- Reliability coefficients are indicators that
reflect the degree to which scores are free of
measurement error. - The indicator is represented at times by a
correlation coefficient and represents the ratio
of the variance of individual differences to
observed score variance for a particular examinee
population. - The variance of individual differences is a
latent variable and unobserved. - We rely on estimates of these variance components
obtained from the observed data.
20More on Reliability
- The conditions under which the coefficient is
estimated may involve variations in - instrument forms
- measurement occasions
- raters
- items
- Taken together, the above suggest that
reliability can be thought of as the degree to
which individuals are rank ordered in a
consistent way across measurement contexts.
21Reliability Coefficients (examples)
- The most commonly encountered reliability
coefficient is one which measures internal
consistency. - An internal consistency coefficient is an index
of the reliability of instrument scores derived
from the statistical interrelationships of
responses among item responses or scores on
separate parts of an instrument.
22Internal Consistency
- An important assumption underlying measures of
internal consistency is the comparability of
items. - Each item is assumed to be an appropriate measure
of the construct believed to be captured by the
entire instrument. - To compute these coefficients you must have item
level data (i.e. the participants actual
responses to all items).
23Coefficient Alpha
- Coefficient alpha is an internal consistency
reliability coefficient based on the number or
parts into which the instrument is partitioned,
the interrelationships of the parts, and the
total instrument score variance. AKA Cronbachs
alpha or KR-20 (for dichotomous items).
24Other reliability coefficients
- Questions concerning stability of scores across
time points (test-retest), forms of a test
(alternate forms) or across raters or judges are
also concerned with reliability. - To return to the introduction to reliability, the
above point is why the meaning of repeated
applications in the definition of reliability is
situation dependent.
25Reliability Coefficients Continued
- While internal consistency measures are extremely
useful when the focus is on one test (or one form
of a test) there are other questions that can be
answered when other data are available. - When our question of reliability is not focused
on the internal consistency of a single
instrument the most common indicator of
reliability is a correlation between two
instruments.
26Correlations
- Correlations are analyses familiar to many
people, even if youve never actually computed
one. - The correlation is simple to compute provided you
have the basic pieces of statistical information
provided before (the mean and the variance). - Well also show you how to do it with raw scores
on the instruments
27Information about Correlations
- The correlation (r) between any two variables is
a measure of association. - Strength of the relationship and direction of the
relationship (/-) are two important pieces of
information available from a single coefficient. - In the context of reliability analysis, we expect
strong positive relationships. - Correlations are limited in range
- -1 lt r lt 1
28Computing a correlation coefficient when means
and variances are available
29Dont have summary statistics? Not a problem!
30Computational Example
31Reliability Coefficients To what degree do
measures agree across contexts? r .00
r .50
r .90
32Reliability Coefficients
- The Spearman-Brown formula is a formula derived
within true score test theory that projects the
reliability of a shortened or lengthened
instrument from the reliability of an instrument
of a specified length. - M represents the length of the new form relative
to the length of the old form (e.g., if M2, that
means the new instrument is twice as long as the
old, if M1/2, the new instrument is half as long
as the old instrument, etc.). - In other words it is used for figuring out the
possible reliability if you were to change the
instrument length.
33Split-Halves Estimates
- The split-halves reliability coefficient is an
internal consistency coefficient obtained by
using half of the items on the instrument to
yield one score and the other half of the items
to yield a second, independent score. The
correlation between the two halves, adjusted via
the Spearman-Brown formula by replacing M with 2,
provides an estimate of the alternate-form
reliability of the total instrument.
34Standards for Reliability
- While there is no general rule for interpreting
reliability coefficients, it is commonly agreed
upon that the more individualized a decision
based on a measurement is and the higher the
stakes involved in that decision, the higher the
reliability needs to be. - Generally, you could get away with group-based
decisions in a research setting with
reliabilities as low as .50. - If you are making high-stakes decisions about
individuals, you need reliabilities above .80 and
preferably in the .90s.
35Basic Item Analysis
- In the context of true score test theory we are
still dealing with observed scores tied to the
metric of the test. - As long as measurement error is minimized we can
be relatively sure that higher scoring students
are likely to be more able. - Now that you have your basic arsenal of
statistics there are a couple of relatively
simple item analyses you can perform. - The two most commonly referenced item statistics
are the item point-biserial correlation and the
p-value. - Item analysis is especially important when items
are in their infancy (field testing).
36P-values
- P-values are sometimes referred to as item
difficulty estimates. - The item p-value is the average item score across
all examinees. - Add up the item scores, ?Xi, and divide by the
number of cases you used for the sum, n, to
obtain an average, ?Xi / n - When items are scored dichotomously (e.g. 1 is
correct, 0 is incorrect), P-values range from 0
to 1. - Small measures indicate more difficult items
(i.e. fewer people responded correctly). - Referring to the p-value as a measure of item
difficulty is sometimes counterintuitive because
higher values are indicative of easier items.
37P-values
- P-values are sample dependent statistics so they
may change from sample to sample. - This is why MDE always has a planned sampling
strategy when field testing items. - The amount of information in the statistics alone
is limited.
38Using the p-value to inform
- However, when knowledge of item content and the
intended difficulty are combined, the p-value can
help us decide if the item is behaving as
planned. - Perhaps in a math test we expect pre-algebra
items to be more difficult than items testing
arithmetic operations. We could calculate the
p-values for each of the item types and confirm
this. - Usually items are designed to vary in difficulty
within a given content domain. - Within a content items are written such that a
student of low ability can still answer some of
the conceptually more difficult items. - Conversely, we always try to create items across
all content domains which will challenge even the
most able examinee.
39Point-biserial Correlations
- Perhaps you are satisfied with the p-values
obtained in your analysis maybe some of the
items were surprisingly easy or difficult. You
should at the very least take a look at one more
piece of evidence. - It is always a good idea to examine whether or
not the items appear to be measuring the same
thing as the total score. - You should try to confirm that there is a
relationship between a persons performance on an
item and their performance on the instrument as
whole.
40Point-biserial Correlations
- The computation of a point-biserial correlation
coefficient is carried out in the same manner as
the previous example of a correlation
coefficient. - Somewhat tedious because it must be done one item
at a time each item score needs to be correlated
with the total score with the item in question
removed. - Obviously, a negative correlation is not desired.
- A typical rule of thumb is to scrutinize items
with point-biserial correlations less than .3 . - Additionally, if an item had an unexpected
p-value you will often see a counterintuitive
point-biserial correlation as well.
41Synthesis
- If you combine the information obtained in this
session you can get a fairly good indication of
preliminary instrument functioning without the
use of highly sophisticated statistics. - When our instrument is reliable we are
confident in the instruments ability to
correctly rank order our examinees. - After computing your reliability you can also dig
deeper and examine item quality.
42Synthesis
- The item quality indicators can often help in the
diagnosis of lower than desired reliability
coefficients (or a high SEM). - Items with low point-biserial correlations will
yield low reliability for the instrument as a
whole. - Instruments with different p-value and
point-biserial distributions will tend to yield
low test-retest or alternate form reliability
coefficients.
43Contact Information
- Steven Viger
- Michigan Department of Education608 W. Allegan
St.Lansing, MI 48909Office (517)
241-2334Fax (517) 335-1186 - VigerS_at_Michigan.gov
44Contact Information
- Joseph Martineau
- Michigan Department of Education608 W. Allegan
St.Lansing, MI 48909 - Office (517) 241-4710Fax (517) 335-1186
- MartineauJ_at_Michigan.gov