Measurement 101 - PowerPoint PPT Presentation

About This Presentation
Title:

Measurement 101

Description:

We conceptualize this data as forming a distribution of scores. ... True frequency distributions count the occurrence of each individual score (possible scores) ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 45
Provided by: Instructio8
Learn more at: https://www.michigan.gov
Category:

less

Transcript and Presenter's Notes

Title: Measurement 101


1
Measurement 101
  • Steven Viger
  • Lead Psychometrician, Office of General
    Assessment and Accountability, Michigan Dept. of
    Education
  • Joseph Martineau, Ph.D.
  • Interim Director, Office of General Assessment
    and Accountability, Michigan Dept. of Education

2
Agenda
  • This session will introduce you to some of the
    basic psychometric techniques utilized by the
    Department of Education.
  • The focus of this session will be on techniques
    grounded in Classical Test Theory (CTT).
  • Analyses driven by the raw score metric at both
    the test and item level.
  • Specifically, the basic analysis of total test
    score and item scores will be presented in the
    context of instrument quality and functioning.
  • Prior to discussing the specific indicators we
    will begin with a brief primer on statistical
    concepts necessary to fully appreciate the
    analytic techniques.

3
Statistics
  • While it is not necessarily the case that a
    Psychometrician is a Statistician, it is
    necessary to have a fairly sophisticated
    understanding of statistics to fully appreciate
    the mechanics of psychometric analyses.
  • Formulas for determining various psychometric
    indicators are in a sense recipes.
  • If we do not provide the proper ingredients, at
    the proper time, and in a manner consistent with
    the recipe success is not likely.
  • Therefore, we will begin with a description of
    the most common ingredients used in item and
    reliability analyses.
  • Additionally, we will discuss the common
    operators (i.e. rules for adding and mixing the
    ingredients) encountered in the analyses.
  • The goal is to not only show you the formulas but
    to also help you truly understand the mechanics
    of the analyses.

4
Back to Basics
  • Most psychometric formulas are mixtures of very
    basic and common statistics.
  • In classical analyses we are typically operating
    on a collection of data. We conceptualize this
    data as forming a distribution of scores.
  • An example of a common distribution is the normal
    distribution or bell curve.
  • Psychometric analyses always utilize summary
    measures of either the whole distribution or
    specific chunks of the distribution
  • a measure of central tendency an indicator of
    where most of the data reside in the distribution
  • a measure of variability an indicator of how
    much (or how little) the data is spread out
    around the measure of central tendency.

5
  • One first step in summarizing data is to create a
    frequency distribution showing the number of
    times a given data value occurred in your sample.
  • It is likely that you will encounter frequency
    distributions which group the data into ranges or
    intervals, and counts are provided to show
    numbers or percentages of observations within the
    defined intervals.
  • True frequency distributions count the occurrence
    of each individual score (possible scores).

6
Histograms
  • The graphic version of a frequency distribution
    that uses intervals is a histogram (sometimes
    incorrectly called a bar chart).
  • We examine frequency distributions and histograms
    for each variable of interest to get an
    impression of the overall shape of the data and
    to see whether there are outliers in the data.

7
For simplicity of expression, we use symbols to
represent various concepts and operations in
statistics. VariablesThe codes (often numerical
codes) we use to describe the constructs were
interested in. Variables are indicated by
upper-case letters (X, Y). Individual values are
represented using subscripts (Xi, Yj).
SummationWe frequently need to add a series of
observations for a variable in order to summarize
that variable or to perform other operations. The
Greek upper-case sigma (S) is used to symbolize
this.
8
Say the data are these pretest scores 8, 7, 5,
10 X1 would be the score of the first person in
the data set. Here X1 8. The first score is
not necessarily the largest (or smallest) score,
because we dont assume the scores are
ordered. Xi is the ith score (or any case) and
although there is no implied order, the data are
arranged in a certain way meaning that we can use
their layout to specify positions -- here you
select what value of i you are interested
in If i 3, Xi 5. If i N, then here N
4, so Xn 10. Saying Xi, for i 1 to N means
the set of all N scores.
9
Summation Notation Applied
  • Frequently, it is clear that we want to sum all
    values of X, so we can simply write
  • which equals (X1 X2 X3 XN)
    and means the same thing as
  • Other common summations
  • The sum of the squared values of X
  • The square of the sum of X

10
  • There are three primary measures of central
    tendency
  • Mode (Mo) The most frequently occurring data
    value.
  • Median (Med) When the data are rank ordered,
    the middle value (or average of middle values
    when there is an even number of observations).
    The median, therefore, represents the 50th
    percentile of the data values.
  • Mean (also or X-bar) The arithmetic
    average. Obtained by adding all data values and
    dividing by the number of observations

11
Mean, Med, Mo
The mean, median, and mode are equal ONLY when
the distribution is symmetrical and unimodal
(i.e. normal). When the distribution is
skewed and unimodal, the mode will be the hump
in the distribution. The mean will be pulled out
toward the tail of the skew. The median will
likely be between the other two values.
Mo
Med
Mean
12
  • Another characteristic of a distribution that we
    may wish to summarize is its dispersion or spread
    on the underlying continuum.
  • For example, in the plot below, the blue and red
    distributions have the same measure of central
    tendency, but the red one is more widely
    dispersed (wider and flatter) along the X-axis.
    Another way to say this is that the data are more
    spread out about the measure of central tendency.

13
  • Some common measures of variability
  • Range The difference between the two most
    extreme data points (maximum minimum).
  • Variance ( s2x or s-squared sub X) The
    average squared deviation of scores from the
    mean. The numerator of the equation is commonly
    referred to as the sum of squares (SS). The
    reason we square the deviations from the mean is
    to eliminate negative numbers and also to avoid
    the strong possibility of summing to zero

14
Measures of Variability
  • Standard Deviation (sxs-sub X) The average
    absolute deviation of scores from the meanalso
    the square root of the variance

15
Reliability
  • Reliability refers to the degree to which
    instrument scores for a group of participants are
    consistent over repeated applications of a
    measurement procedure and are, therefore,
    dependable and repeatable.
  • The definition of repeated applications is
    situation dependent.
  • Reliability is one of the most fundamental
    requirements for measurementif the measures are
    not reliable, then it is difficult to support
    claims that the measures can be valid for any
    particular decision.

16
True Score A theoretic score for a person on an
instrument that is equal to the average score for
that person over an infinitely large number of
retakes. We estimate this value using the
persons score on a single administration of the
instrument.   Error The degree to which an
observed score (X) varies from the persons
theoretical true score (T). Error is designated
E. X T E In measurement, reliability refers
to the degree to which scores are free of
measurement errors for a particular group if we
assume the relationship of observed and true
scores are depicted as above.
17
Standard Error of Measurement
  • The standard error of measurement (SEM) is an
    estimate of the amount of error present in a
    students score.
  • If X T E, the SEM serves as a general estimate
    of the E portion of the equation.
  • There is an inverse relationship between the SEM
    and reliability. Tests with higher reliability
    have smaller SEMs relative to the standard
    deviation of the test score. 

18
More on the Standard Error of Measurement
  • The smaller the SEM for a test (and, therefore,
    the higher the reliability), the greater one can
    depend on the ordering of scores to represent
    stable differences between students.
  • The more confident you can be in the observed
    score, X, being an accurate estimate of the
    students true score, T.
  • The converse also holds. That is, the larger the
    SEM, the more error is likely present in the test
    scores for each student.  Therefore, the less
    confident one should be about the stability of
    the ordering of students on the basis of their
    test scores.
  • What this usually means is that the spread of
    scores among the students is due more to error in
    measuring their knowledge about the course
    content than to what the students actually know
    or have learned. 

19
Reliability Coefficients (in general)
  • Reliability coefficients are indicators that
    reflect the degree to which scores are free of
    measurement error.
  • The indicator is represented at times by a
    correlation coefficient and represents the ratio
    of the variance of individual differences to
    observed score variance for a particular examinee
    population.
  • The variance of individual differences is a
    latent variable and unobserved.
  • We rely on estimates of these variance components
    obtained from the observed data.

20
More on Reliability
  • The conditions under which the coefficient is
    estimated may involve variations in
  • instrument forms
  • measurement occasions
  • raters
  • items
  • Taken together, the above suggest that
    reliability can be thought of as the degree to
    which individuals are rank ordered in a
    consistent way across measurement contexts.

21
Reliability Coefficients (examples)
  • The most commonly encountered reliability
    coefficient is one which measures internal
    consistency.
  • An internal consistency coefficient is an index
    of the reliability of instrument scores derived
    from the statistical interrelationships of
    responses among item responses or scores on
    separate parts of an instrument.

22
Internal Consistency
  • An important assumption underlying measures of
    internal consistency is the comparability of
    items.
  • Each item is assumed to be an appropriate measure
    of the construct believed to be captured by the
    entire instrument.
  • To compute these coefficients you must have item
    level data (i.e. the participants actual
    responses to all items).

23
Coefficient Alpha
  • Coefficient alpha is an internal consistency
    reliability coefficient based on the number or
    parts into which the instrument is partitioned,
    the interrelationships of the parts, and the
    total instrument score variance. AKA Cronbachs
    alpha or KR-20 (for dichotomous items).

24
Other reliability coefficients
  • Questions concerning stability of scores across
    time points (test-retest), forms of a test
    (alternate forms) or across raters or judges are
    also concerned with reliability.
  • To return to the introduction to reliability, the
    above point is why the meaning of repeated
    applications in the definition of reliability is
    situation dependent.

25
Reliability Coefficients Continued
  • While internal consistency measures are extremely
    useful when the focus is on one test (or one form
    of a test) there are other questions that can be
    answered when other data are available.
  • When our question of reliability is not focused
    on the internal consistency of a single
    instrument the most common indicator of
    reliability is a correlation between two
    instruments.

26
Correlations
  • Correlations are analyses familiar to many
    people, even if youve never actually computed
    one.
  • The correlation is simple to compute provided you
    have the basic pieces of statistical information
    provided before (the mean and the variance).
  • Well also show you how to do it with raw scores
    on the instruments

27
Information about Correlations
  • The correlation (r) between any two variables is
    a measure of association.
  • Strength of the relationship and direction of the
    relationship (/-) are two important pieces of
    information available from a single coefficient.
  • In the context of reliability analysis, we expect
    strong positive relationships.
  • Correlations are limited in range
  • -1 lt r lt 1

28
Computing a correlation coefficient when means
and variances are available
29
Dont have summary statistics? Not a problem!
30
Computational Example
31
Reliability Coefficients To what degree do
measures agree across contexts? r .00
r .50
r .90
32
Reliability Coefficients
  • The Spearman-Brown formula is a formula derived
    within true score test theory that projects the
    reliability of a shortened or lengthened
    instrument from the reliability of an instrument
    of a specified length.
  • M represents the length of the new form relative
    to the length of the old form (e.g., if M2, that
    means the new instrument is twice as long as the
    old, if M1/2, the new instrument is half as long
    as the old instrument, etc.).
  • In other words it is used for figuring out the
    possible reliability if you were to change the
    instrument length.

33
Split-Halves Estimates
  • The split-halves reliability coefficient is an
    internal consistency coefficient obtained by
    using half of the items on the instrument to
    yield one score and the other half of the items
    to yield a second, independent score. The
    correlation between the two halves, adjusted via
    the Spearman-Brown formula by replacing M with 2,
    provides an estimate of the alternate-form
    reliability of the total instrument.

34
Standards for Reliability
  • While there is no general rule for interpreting
    reliability coefficients, it is commonly agreed
    upon that the more individualized a decision
    based on a measurement is and the higher the
    stakes involved in that decision, the higher the
    reliability needs to be.
  • Generally, you could get away with group-based
    decisions in a research setting with
    reliabilities as low as .50.
  • If you are making high-stakes decisions about
    individuals, you need reliabilities above .80 and
    preferably in the .90s.

35
Basic Item Analysis
  • In the context of true score test theory we are
    still dealing with observed scores tied to the
    metric of the test.
  • As long as measurement error is minimized we can
    be relatively sure that higher scoring students
    are likely to be more able.
  • Now that you have your basic arsenal of
    statistics there are a couple of relatively
    simple item analyses you can perform.
  • The two most commonly referenced item statistics
    are the item point-biserial correlation and the
    p-value.
  • Item analysis is especially important when items
    are in their infancy (field testing).

36
P-values
  • P-values are sometimes referred to as item
    difficulty estimates.
  • The item p-value is the average item score across
    all examinees.
  • Add up the item scores, ?Xi, and divide by the
    number of cases you used for the sum, n, to
    obtain an average, ?Xi / n
  • When items are scored dichotomously (e.g. 1 is
    correct, 0 is incorrect), P-values range from 0
    to 1.
  • Small measures indicate more difficult items
    (i.e. fewer people responded correctly).
  • Referring to the p-value as a measure of item
    difficulty is sometimes counterintuitive because
    higher values are indicative of easier items.

37
P-values
  • P-values are sample dependent statistics so they
    may change from sample to sample.
  • This is why MDE always has a planned sampling
    strategy when field testing items.
  • The amount of information in the statistics alone
    is limited.

38
Using the p-value to inform
  • However, when knowledge of item content and the
    intended difficulty are combined, the p-value can
    help us decide if the item is behaving as
    planned.
  • Perhaps in a math test we expect pre-algebra
    items to be more difficult than items testing
    arithmetic operations. We could calculate the
    p-values for each of the item types and confirm
    this.
  • Usually items are designed to vary in difficulty
    within a given content domain.
  • Within a content items are written such that a
    student of low ability can still answer some of
    the conceptually more difficult items.
  • Conversely, we always try to create items across
    all content domains which will challenge even the
    most able examinee.

39
Point-biserial Correlations
  • Perhaps you are satisfied with the p-values
    obtained in your analysis maybe some of the
    items were surprisingly easy or difficult. You
    should at the very least take a look at one more
    piece of evidence.
  • It is always a good idea to examine whether or
    not the items appear to be measuring the same
    thing as the total score.
  • You should try to confirm that there is a
    relationship between a persons performance on an
    item and their performance on the instrument as
    whole.

40
Point-biserial Correlations
  • The computation of a point-biserial correlation
    coefficient is carried out in the same manner as
    the previous example of a correlation
    coefficient.
  • Somewhat tedious because it must be done one item
    at a time each item score needs to be correlated
    with the total score with the item in question
    removed.
  • Obviously, a negative correlation is not desired.
  • A typical rule of thumb is to scrutinize items
    with point-biserial correlations less than .3 .
  • Additionally, if an item had an unexpected
    p-value you will often see a counterintuitive
    point-biserial correlation as well.

41
Synthesis
  • If you combine the information obtained in this
    session you can get a fairly good indication of
    preliminary instrument functioning without the
    use of highly sophisticated statistics.
  • When our instrument is reliable we are
    confident in the instruments ability to
    correctly rank order our examinees.
  • After computing your reliability you can also dig
    deeper and examine item quality.

42
Synthesis
  • The item quality indicators can often help in the
    diagnosis of lower than desired reliability
    coefficients (or a high SEM).
  • Items with low point-biserial correlations will
    yield low reliability for the instrument as a
    whole.
  • Instruments with different p-value and
    point-biserial distributions will tend to yield
    low test-retest or alternate form reliability
    coefficients.

43
Contact Information
  • Steven Viger
  • Michigan Department of Education608 W. Allegan
    St.Lansing, MI 48909Office (517)
    241-2334Fax (517) 335-1186
  • VigerS_at_Michigan.gov

44
Contact Information
  • Joseph Martineau
  • Michigan Department of Education608 W. Allegan
    St.Lansing, MI 48909
  • Office (517) 241-4710Fax (517) 335-1186
  • MartineauJ_at_Michigan.gov
Write a Comment
User Comments (0)
About PowerShow.com