Old Faithful: Reliability in psychometrics - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Old Faithful: Reliability in psychometrics

Description:

When we say a car or our best friend is reliable' what do we mean? Reliability ... judgments (projective tests; personality ratings), different scorers may give ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 33
Provided by: chriswe
Category:

less

Transcript and Presenter's Notes

Title: Old Faithful: Reliability in psychometrics


1
Old FaithfulReliability in psychometrics
2
Reliability
  • Review
  • What is reliability?
  • Types of reliability
  • Test-Retest reliability
  • Parallel forms reliability
  • Internal Consistency measures
  • Inter-rater reliability
  • Standard measure of error

3
Review What is variance?
  • s2 variance the average squared difference
    from the mean.
  • It is the square of standard deviation

4
Review what is a coefficient?
  • a. A number or quantity placed (usually) before
    and multiplying another quantity known or
    unknown.
  • Thus in 4x2 2ax, 4 is the coefficient of x2, 2
    of ax, and 2a of x
  • b. A multiplier that measures some property of a
    particular substance, for which it is constant,
    while differing for different substances.
  • e.g. coefficient of friction, expansion,
    torsion, etc.

5
What is reliability?
  • When we say a car or our best friend is
    reliable what do we mean?

6
What is reliability?
  • The reliability of a test is the extent to which
    can be relied upon to produce true scores a
    measure of its consistency/ variability across
    occasions or with different sets of equivalent
    items
  • Reliability True variance / Observed variance
  • S2true / S2observed

7
What is reliability?
  • Reliability True variance / Observed variance
  • S2true / S2observed
  • It would be great (and we could all go home now)
    if we knew S2true
  • Alas, we can only measure S2observedand we can
    never know S2true
  • Hence we can never measure reliability
  • so can we go home now?

8
No.
  • We may not know, but that doesnt mean we cant
    estimate and quantify the uncertainty in our
    estimate (After all, were psychometricians, not
    logicians! What is certainty to the likes of us?)
  • As chance (literally) would have it, there are
    systematic ways to estimate reliability, which
    enable us to state in fairly precise terms how
    uncertain we are about S2true

9
Classical reliability theory
  • Test reliability S2true / S2observed
  • This ratio can never be greater than 1 Why?
  • This ratio will usually be quite a bit lower than
    1because

10
Because
  • Test reliability S2true / S2observed
  • Why isnt it the case that what we see is what
    we get? That is, why isnt S2observed the true
    variance, or (to ask the same question) why
    doesnt S2true S2observed
  • It is because we cant measure anything without
    error
  • Observed variance in scores (S2observed )
    includes an error (unsystematic) component
  • S2observed S2true S2error

11
What is error?
  • S2observed S2true S2error
  • Error is the amount of uncertainty you have in a
    measurement

12
How do we know there is error?
  • We see evidence of error everywhere where r lt 1
    when it might be 1
  • Scores on tests given more than once dont
    correlate at 1.0
  • Subsets of questions on tests that claim to
    measure the same thing dont correlate at 1.0
  • In other words, measuring constructs is not like
    counting apples

13
Where does error come from?
  • Error comes from multiple independent sources
    mistakes by the subject or the scorer,
    differences in administration conditions,
    differences of opinion or practice in how to
    score, chance differences in sampling the
    question space, individual differences in
    understanding, random quantum events, hormonal
    fluctuations, hang-overs, illness, shifting
    attention etc.
  • The main point is not that a source of error
    could not be controlled for, but that it was not
    controlled for
  • No matter what you do control for, there will be
    some things you didnt control for error is
    inevitable

14
What can we do about error?
  • This idea of multiple independent sources
    meeting should immediately bring to mind our
    remarkable discovery earlier Randomness breeds
    order!
  • Precisely because error is guaranteed (by
    definition) to be random, it is guaranteed (by
    the mathematical subtlety of our wonderful world)
    to have a comprehensible structure error is
    guaranteed to be normally distributed

15
Why is this so thrilling?
  • Because error is normally distributed, we can
    quantify (standardize) it in the same ways we can
    quantify any normally distributed measure
  • In particular, we can give the average and
    standard deviation of any error measure
  • As you know, therefore we can compute the
    probability of any given error, and so we can
    quantify confidence bounds on any measure
  • eg. There is a 66 chance that a true score
    falls within 1 SDerror and a 95 chance that
    true score falls with two SDserror of the
    obtained score

16
Standard error of measurement
  • Reliability allows us to estimate measurement
    error in standardized unit
  • Serr S (1 - r)0.5
  • S Observed population SD of test scores
  • Serr estimates the SD a person would obtain if he
    took the test infinitely many times
  • R A reliability coefficient (which are about
    to discuss)
  • Serr is the SD of error for an individual

17
Standard error of measurement
  • Serr S (1 - r)0.5
  • What are the properties of Serr (that is, of
    error size)?
  • Inverse relation between reliability and error
    The lower the reliability r, the higher the error
  • Direct relation between error and Sobserved The
    wider the distribution of scores you have, the
    larger the error on any individual score is going
    to be
  • We cant usually do much about the latter, so we
    want to maximize the former- reliability- as much
    as possible (we will discuss how at the end of
    this class)

18
Example
  • IQ tests have a mean of 100 and a SD of 15. One
    test has a reliability of 0.89. You score 130 on
    that IQ test. What is the standard error of
    measurement? What is the 95 1.96 SD confidence
    interval on your IQ?
  • Serr S (1 - r)0.5
  • 15 sqrt(1- 0.89)
  • 4.97
  • We can be 95 sure that your true IQ is between
    120.3 139.7.

19
Standard Error of the mean
  • Be careful not to confuse standard error of
    measurement with standard error of the mean!
  • The standard error of the mean gives us the
    standard deviation around a mean after repeated
    sampling eg. It tells us how certain we can be
    about the average
  • This is S2 / (N) 0.5
  • Many computer programs will report it and some of
    you will give it to me as if it were the standard
    error of measurement, making me sad and less able
    to carry on my previously stellar teaching work,
    and possibly driving me first to strong drink and
    then to an early death. Dont let this be on your
    hands!

20
Serr S (1 - r)0.5So Where does that r come
from?
  • Where can we get our reliability coefficient?
  • Earlier we said We see evidence of error
    everywhere where r lt 1 when it might be 1
  • Scores on tests given more than once dont
    correlate at 1.0
  • Subsets of questions on tests that claim to
    measure the same thing dont correlate at 1.0
  • Recall that test reliability r S2true /
    S2observed S2error
  • If there is S2error, there is S2true (unknown)
    and S2observed too we can estimate reliability
    wherever we can find variation relevant to our
    test

21
Test-retest Reliability
  • Correlate scores of the same people with two
    different administrations
  • The r is called the test-retest coefficient or
    coefficient of stability
  • There is no variance due to item differences or
    conditions of administration
  • Shorter inter-test intervals give larger r

22
Parallel-forms reliability
  • One factor that does impact on test-retest
    reliability is individual differences in memory
  • Solution is to give two or more forms of the
    test, to get a parallel forms coefficient or
    coefficient of equivalence

23
Parallel-forms reliability
  • How can we deal with error from two sources
    error due to different test times and error due
    to different forms?
  • Use Form A with half the sample, and Form B with
    the other half at T1 then switch at T2
  • The correlation between scores on both forms is
    the coefficient of stability and equivalence,
    taking into account errors due to both time of
    administration and due to different test items on
    the two forms

24
Internal consistency Split-half method
  • We can treat a single form as two forms split it
    into two arbitrary halves and correlate scores on
    each half (Split half reliability)
  • To get the reliability of the test as a whole
    (assuming equal means and variance), use
    Spearman-Brown prophecy formula
  • rwhole 2rhalf/(1 rhalf)

25
Ramping up the split-half method
  • The split half method takes arbitrary halves
  • However, different arbitrary halves might give
    different r values How can we know which half to
    take?
  • Wherever we have uncertainty Measure many times
    and average Maybe we can take several arbitrary
    halves and average their correlation?
  • A better method might be to take all possible
    split halves, and average their values
  • Luckily, there is a (fairly) easy way to do
    this...

26
Internal consistency Cronbachs alpha
  • Cronbachs (1951) alpha (also sometimes called
    coefficient alpha) is a widely used and widely
    reported measure of the extent to which item
    responses obtained at the same time correlate
    highly with each other.
  • Note This is not the same as being a measure of
    unidimensionality, though it is sometimes
    reported as being so
  • You can get a high alpha coefficient with
    distinct, but highly-intercorrelated, dimensions
    in a test.
  • Cronbachs alpha is mathematically equivalent to
    taking an average of all split halves, and is the
    best measure of internal consistency there is

27
Cronbachs alpha
  • Alpha (k/(k-1)) 1- SUM (s2i) / s2total
  • k the number of items
  • s2i the variances of scores for item I
  • SUM (s2i) the sum of all item variances
  • s2total the total variance for all items.

28
Inter-rater reliability
  • On tests requiring evaluative judgments
    (projective tests personality ratings),
    different scorers may give different scores
  • Inter-rater reliability is the correlation
    between their scores
  • Generalized you get an intraclass coefficient (or
    coefficient of concordance) as the average
    correlation between many raters

29
How much reliability is enough?
  • As usual, in this uncertain world there is no
    simple answer to this simple question
  • Alphas for personality tests (0.46 - 0.96) tend
    to be lower than alphas for achievement and
    aptitude tests (0.66 - 0.98)
  • If you are comparing groups means, modest alphas
    of 0.6 to 0.7 may be sufficient 0.8 is better
  • If you want to make claims about differences
    between single individuals, you need to have more
    reliable scores alphas of 0.85 or better
  • If life-changing decisions are to be made, you
    want reliabilities of 0.9 or 0.95 (but you will
    have trouble finding them in many fields!)

30
How can we increase reliability?
  • Analyze your items
  • Bad items increase error (thats what it means to
    be a bad item!), therefore they decrease
    reliability
  • Increase the number of items
  • Longer tests are generally more reliable (Why?)
  • This is why error from subtests does not sum to
    total error We decrease our error when we have
    more items
  • Factor analyze
  • Unidimensional tests are more reliable (Why?)
  • Factor analyze to find if you are looking for
    spurious reliability

31
How can we increase reliability?
  • Distinguish between real variability in what you
    are measuring, and variability due to error (eg.
    Test-retest correlations may be irrelevant if the
    thing being measured is changing)
  • FYI You can figure out how many items you need
    to get a given reliability using a generalization
    of Spearmans prophecy formula

32
Validity vs. reliability
Image from http//trochim.human.cornell.edu/kb I
highly recommend this site.
Write a Comment
User Comments (0)
About PowerShow.com