Title: Old Faithful: Reliability in psychometrics
1Old FaithfulReliability in psychometrics
2Reliability
- Review
- What is reliability?
- Types of reliability
- Test-Retest reliability
- Parallel forms reliability
- Internal Consistency measures
- Inter-rater reliability
- Standard measure of error
3Review What is variance?
- s2 variance the average squared difference
from the mean. - It is the square of standard deviation
4Review what is a coefficient?
- a. A number or quantity placed (usually) before
and multiplying another quantity known or
unknown. - Thus in 4x2 2ax, 4 is the coefficient of x2, 2
of ax, and 2a of x - b. A multiplier that measures some property of a
particular substance, for which it is constant,
while differing for different substances. - e.g. coefficient of friction, expansion,
torsion, etc.
5What is reliability?
- When we say a car or our best friend is
reliable what do we mean?
6What is reliability?
- The reliability of a test is the extent to which
can be relied upon to produce true scores a
measure of its consistency/ variability across
occasions or with different sets of equivalent
items - Reliability True variance / Observed variance
- S2true / S2observed
7What is reliability?
- Reliability True variance / Observed variance
- S2true / S2observed
- It would be great (and we could all go home now)
if we knew S2true - Alas, we can only measure S2observedand we can
never know S2true - Hence we can never measure reliability
- so can we go home now?
8No.
- We may not know, but that doesnt mean we cant
estimate and quantify the uncertainty in our
estimate (After all, were psychometricians, not
logicians! What is certainty to the likes of us?) - As chance (literally) would have it, there are
systematic ways to estimate reliability, which
enable us to state in fairly precise terms how
uncertain we are about S2true
9Classical reliability theory
- Test reliability S2true / S2observed
- This ratio can never be greater than 1 Why?
- This ratio will usually be quite a bit lower than
1because
10Because
- Test reliability S2true / S2observed
- Why isnt it the case that what we see is what
we get? That is, why isnt S2observed the true
variance, or (to ask the same question) why
doesnt S2true S2observed - It is because we cant measure anything without
error - Observed variance in scores (S2observed )
includes an error (unsystematic) component - S2observed S2true S2error
11What is error?
- S2observed S2true S2error
- Error is the amount of uncertainty you have in a
measurement
12How do we know there is error?
- We see evidence of error everywhere where r lt 1
when it might be 1 - Scores on tests given more than once dont
correlate at 1.0 - Subsets of questions on tests that claim to
measure the same thing dont correlate at 1.0 - In other words, measuring constructs is not like
counting apples
13Where does error come from?
- Error comes from multiple independent sources
mistakes by the subject or the scorer,
differences in administration conditions,
differences of opinion or practice in how to
score, chance differences in sampling the
question space, individual differences in
understanding, random quantum events, hormonal
fluctuations, hang-overs, illness, shifting
attention etc. - The main point is not that a source of error
could not be controlled for, but that it was not
controlled for - No matter what you do control for, there will be
some things you didnt control for error is
inevitable
14What can we do about error?
- This idea of multiple independent sources
meeting should immediately bring to mind our
remarkable discovery earlier Randomness breeds
order! - Precisely because error is guaranteed (by
definition) to be random, it is guaranteed (by
the mathematical subtlety of our wonderful world)
to have a comprehensible structure error is
guaranteed to be normally distributed
15Why is this so thrilling?
- Because error is normally distributed, we can
quantify (standardize) it in the same ways we can
quantify any normally distributed measure - In particular, we can give the average and
standard deviation of any error measure - As you know, therefore we can compute the
probability of any given error, and so we can
quantify confidence bounds on any measure - eg. There is a 66 chance that a true score
falls within 1 SDerror and a 95 chance that
true score falls with two SDserror of the
obtained score
16Standard error of measurement
- Reliability allows us to estimate measurement
error in standardized unit - Serr S (1 - r)0.5
- S Observed population SD of test scores
- Serr estimates the SD a person would obtain if he
took the test infinitely many times - R A reliability coefficient (which are about
to discuss) - Serr is the SD of error for an individual
17Standard error of measurement
- Serr S (1 - r)0.5
- What are the properties of Serr (that is, of
error size)? - Inverse relation between reliability and error
The lower the reliability r, the higher the error - Direct relation between error and Sobserved The
wider the distribution of scores you have, the
larger the error on any individual score is going
to be - We cant usually do much about the latter, so we
want to maximize the former- reliability- as much
as possible (we will discuss how at the end of
this class)
18Example
- IQ tests have a mean of 100 and a SD of 15. One
test has a reliability of 0.89. You score 130 on
that IQ test. What is the standard error of
measurement? What is the 95 1.96 SD confidence
interval on your IQ? - Serr S (1 - r)0.5
- 15 sqrt(1- 0.89)
- 4.97
- We can be 95 sure that your true IQ is between
120.3 139.7.
19Standard Error of the mean
- Be careful not to confuse standard error of
measurement with standard error of the mean! - The standard error of the mean gives us the
standard deviation around a mean after repeated
sampling eg. It tells us how certain we can be
about the average - This is S2 / (N) 0.5
- Many computer programs will report it and some of
you will give it to me as if it were the standard
error of measurement, making me sad and less able
to carry on my previously stellar teaching work,
and possibly driving me first to strong drink and
then to an early death. Dont let this be on your
hands!
20Serr S (1 - r)0.5So Where does that r come
from?
- Where can we get our reliability coefficient?
- Earlier we said We see evidence of error
everywhere where r lt 1 when it might be 1 - Scores on tests given more than once dont
correlate at 1.0 - Subsets of questions on tests that claim to
measure the same thing dont correlate at 1.0 - Recall that test reliability r S2true /
S2observed S2error - If there is S2error, there is S2true (unknown)
and S2observed too we can estimate reliability
wherever we can find variation relevant to our
test
21Test-retest Reliability
- Correlate scores of the same people with two
different administrations - The r is called the test-retest coefficient or
coefficient of stability - There is no variance due to item differences or
conditions of administration - Shorter inter-test intervals give larger r
22Parallel-forms reliability
- One factor that does impact on test-retest
reliability is individual differences in memory - Solution is to give two or more forms of the
test, to get a parallel forms coefficient or
coefficient of equivalence
23Parallel-forms reliability
- How can we deal with error from two sources
error due to different test times and error due
to different forms? - Use Form A with half the sample, and Form B with
the other half at T1 then switch at T2 - The correlation between scores on both forms is
the coefficient of stability and equivalence,
taking into account errors due to both time of
administration and due to different test items on
the two forms
24Internal consistency Split-half method
- We can treat a single form as two forms split it
into two arbitrary halves and correlate scores on
each half (Split half reliability) - To get the reliability of the test as a whole
(assuming equal means and variance), use
Spearman-Brown prophecy formula - rwhole 2rhalf/(1 rhalf)
25Ramping up the split-half method
- The split half method takes arbitrary halves
- However, different arbitrary halves might give
different r values How can we know which half to
take? - Wherever we have uncertainty Measure many times
and average Maybe we can take several arbitrary
halves and average their correlation? - A better method might be to take all possible
split halves, and average their values - Luckily, there is a (fairly) easy way to do
this...
26Internal consistency Cronbachs alpha
- Cronbachs (1951) alpha (also sometimes called
coefficient alpha) is a widely used and widely
reported measure of the extent to which item
responses obtained at the same time correlate
highly with each other. - Note This is not the same as being a measure of
unidimensionality, though it is sometimes
reported as being so - You can get a high alpha coefficient with
distinct, but highly-intercorrelated, dimensions
in a test. - Cronbachs alpha is mathematically equivalent to
taking an average of all split halves, and is the
best measure of internal consistency there is
27Cronbachs alpha
- Alpha (k/(k-1)) 1- SUM (s2i) / s2total
- k the number of items
- s2i the variances of scores for item I
- SUM (s2i) the sum of all item variances
- s2total the total variance for all items.
28Inter-rater reliability
- On tests requiring evaluative judgments
(projective tests personality ratings),
different scorers may give different scores - Inter-rater reliability is the correlation
between their scores - Generalized you get an intraclass coefficient (or
coefficient of concordance) as the average
correlation between many raters
29How much reliability is enough?
- As usual, in this uncertain world there is no
simple answer to this simple question - Alphas for personality tests (0.46 - 0.96) tend
to be lower than alphas for achievement and
aptitude tests (0.66 - 0.98) - If you are comparing groups means, modest alphas
of 0.6 to 0.7 may be sufficient 0.8 is better - If you want to make claims about differences
between single individuals, you need to have more
reliable scores alphas of 0.85 or better - If life-changing decisions are to be made, you
want reliabilities of 0.9 or 0.95 (but you will
have trouble finding them in many fields!)
30How can we increase reliability?
- Analyze your items
- Bad items increase error (thats what it means to
be a bad item!), therefore they decrease
reliability - Increase the number of items
- Longer tests are generally more reliable (Why?)
- This is why error from subtests does not sum to
total error We decrease our error when we have
more items - Factor analyze
- Unidimensional tests are more reliable (Why?)
- Factor analyze to find if you are looking for
spurious reliability
31How can we increase reliability?
- Distinguish between real variability in what you
are measuring, and variability due to error (eg.
Test-retest correlations may be irrelevant if the
thing being measured is changing) - FYI You can figure out how many items you need
to get a given reliability using a generalization
of Spearmans prophecy formula
32Validity vs. reliability
Image from http//trochim.human.cornell.edu/kb I
highly recommend this site.