Old Faithful: Reliability in psychometrics - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Old Faithful: Reliability in psychometrics

Description:

When we say a car or our best friend is reliable' what do we mean? Reliability ... judgments (projective tests; personality ratings), different scorers may give ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 33

Provided by: chriswe

Category:

more less

Transcript and Presenter's Notes

Title: Old Faithful: Reliability in psychometrics

1
Old FaithfulReliability in psychometrics
2
Reliability

Review
What is reliability?
Types of reliability
Test-Retest reliability
Parallel forms reliability
Internal Consistency measures
Inter-rater reliability
Standard measure of error

3
Review What is variance?

s2 variance the average squared difference
from the mean.
It is the square of standard deviation

4
Review what is a coefficient?

a. A number or quantity placed (usually) before
and multiplying another quantity known or
unknown.
Thus in 4x2 2ax, 4 is the coefficient of x2, 2
of ax, and 2a of x
b. A multiplier that measures some property of a
particular substance, for which it is constant,
while differing for different substances.
e.g. coefficient of friction, expansion,
torsion, etc.

5
What is reliability?

When we say a car or our best friend is
reliable what do we mean?

6
What is reliability?

The reliability of a test is the extent to which
can be relied upon to produce true scores a
measure of its consistency/ variability across
occasions or with different sets of equivalent
items
Reliability True variance / Observed variance
S2true / S2observed

7
What is reliability?

Reliability True variance / Observed variance
S2true / S2observed
It would be great (and we could all go home now)
if we knew S2true
Alas, we can only measure S2observedand we can
never know S2true
Hence we can never measure reliability
so can we go home now?

8
No.

We may not know, but that doesnt mean we cant
estimate and quantify the uncertainty in our
estimate (After all, were psychometricians, not
logicians! What is certainty to the likes of us?)
As chance (literally) would have it, there are
systematic ways to estimate reliability, which
enable us to state in fairly precise terms how
uncertain we are about S2true

9
Classical reliability theory

Test reliability S2true / S2observed
This ratio can never be greater than 1 Why?
This ratio will usually be quite a bit lower than
1because

10
Because

Test reliability S2true / S2observed
Why isnt it the case that what we see is what
we get? That is, why isnt S2observed the true
variance, or (to ask the same question) why
doesnt S2true S2observed
It is because we cant measure anything without
error
Observed variance in scores (S2observed )
includes an error (unsystematic) component
S2observed S2true S2error

11
What is error?

S2observed S2true S2error
Error is the amount of uncertainty you have in a
measurement

12
How do we know there is error?

We see evidence of error everywhere where r lt 1
when it might be 1
Scores on tests given more than once dont
correlate at 1.0
Subsets of questions on tests that claim to
measure the same thing dont correlate at 1.0
In other words, measuring constructs is not like
counting apples

13
Where does error come from?

Error comes from multiple independent sources
mistakes by the subject or the scorer,
differences in administration conditions,
differences of opinion or practice in how to
score, chance differences in sampling the
question space, individual differences in
understanding, random quantum events, hormonal
fluctuations, hang-overs, illness, shifting
attention etc.
The main point is not that a source of error
could not be controlled for, but that it was not
controlled for
No matter what you do control for, there will be
some things you didnt control for error is
inevitable

14
What can we do about error?

This idea of multiple independent sources
meeting should immediately bring to mind our
remarkable discovery earlier Randomness breeds
order!
Precisely because error is guaranteed (by
definition) to be random, it is guaranteed (by
the mathematical subtlety of our wonderful world)
to have a comprehensible structure error is
guaranteed to be normally distributed

15
Why is this so thrilling?

Because error is normally distributed, we can
quantify (standardize) it in the same ways we can
quantify any normally distributed measure
In particular, we can give the average and
standard deviation of any error measure
As you know, therefore we can compute the
probability of any given error, and so we can
quantify confidence bounds on any measure
eg. There is a 66 chance that a true score
falls within 1 SDerror and a 95 chance that
true score falls with two SDserror of the
obtained score

16
Standard error of measurement

Reliability allows us to estimate measurement
error in standardized unit
Serr S (1 - r)0.5
S Observed population SD of test scores
Serr estimates the SD a person would obtain if he
took the test infinitely many times
R A reliability coefficient (which are about
to discuss)
Serr is the SD of error for an individual

17
Standard error of measurement

Serr S (1 - r)0.5
What are the properties of Serr (that is, of
error size)?
Inverse relation between reliability and error
The lower the reliability r, the higher the error
Direct relation between error and Sobserved The
wider the distribution of scores you have, the
larger the error on any individual score is going
to be
We cant usually do much about the latter, so we
want to maximize the former- reliability- as much
as possible (we will discuss how at the end of
this class)

18
Example

IQ tests have a mean of 100 and a SD of 15. One
test has a reliability of 0.89. You score 130 on
that IQ test. What is the standard error of
measurement? What is the 95 1.96 SD confidence
interval on your IQ?
Serr S (1 - r)0.5
15 sqrt(1- 0.89)
4.97
We can be 95 sure that your true IQ is between
120.3 139.7.

19
Standard Error of the mean

Be careful not to confuse standard error of
measurement with standard error of the mean!
The standard error of the mean gives us the
standard deviation around a mean after repeated
sampling eg. It tells us how certain we can be
about the average
This is S2 / (N) 0.5
Many computer programs will report it and some of
you will give it to me as if it were the standard
error of measurement, making me sad and less able
to carry on my previously stellar teaching work,
and possibly driving me first to strong drink and
then to an early death. Dont let this be on your
hands!

20
Serr S (1 - r)0.5So Where does that r come
from?

Where can we get our reliability coefficient?
Earlier we said We see evidence of error
everywhere where r lt 1 when it might be 1
Scores on tests given more than once dont
correlate at 1.0
Subsets of questions on tests that claim to
measure the same thing dont correlate at 1.0
Recall that test reliability r S2true /
S2observed S2error
If there is S2error, there is S2true (unknown)
and S2observed too we can estimate reliability
wherever we can find variation relevant to our
test

21
Test-retest Reliability

Correlate scores of the same people with two
different administrations
The r is called the test-retest coefficient or
coefficient of stability
There is no variance due to item differences or
conditions of administration
Shorter inter-test intervals give larger r

22
Parallel-forms reliability

One factor that does impact on test-retest
reliability is individual differences in memory
Solution is to give two or more forms of the
test, to get a parallel forms coefficient or
coefficient of equivalence

23
Parallel-forms reliability

How can we deal with error from two sources
error due to different test times and error due
to different forms?
Use Form A with half the sample, and Form B with
the other half at T1 then switch at T2
The correlation between scores on both forms is
the coefficient of stability and equivalence,
taking into account errors due to both time of
administration and due to different test items on
the two forms

24
Internal consistency Split-half method

We can treat a single form as two forms split it
into two arbitrary halves and correlate scores on
each half (Split half reliability)
To get the reliability of the test as a whole
(assuming equal means and variance), use
Spearman-Brown prophecy formula
rwhole 2rhalf/(1 rhalf)

25
Ramping up the split-half method

The split half method takes arbitrary halves
However, different arbitrary halves might give
different r values How can we know which half to
take?
Wherever we have uncertainty Measure many times
and average Maybe we can take several arbitrary
halves and average their correlation?
A better method might be to take all possible
split halves, and average their values
Luckily, there is a (fairly) easy way to do
this...

26
Internal consistency Cronbachs alpha

Cronbachs (1951) alpha (also sometimes called
coefficient alpha) is a widely used and widely
reported measure of the extent to which item
responses obtained at the same time correlate
highly with each other.
Note This is not the same as being a measure of
unidimensionality, though it is sometimes
reported as being so
You can get a high alpha coefficient with
distinct, but highly-intercorrelated, dimensions
in a test.
Cronbachs alpha is mathematically equivalent to
taking an average of all split halves, and is the
best measure of internal consistency there is

27
Cronbachs alpha

Alpha (k/(k-1)) 1- SUM (s2i) / s2total
k the number of items
s2i the variances of scores for item I
SUM (s2i) the sum of all item variances
s2total the total variance for all items.

28
Inter-rater reliability

On tests requiring evaluative judgments
(projective tests personality ratings),
different scorers may give different scores
Inter-rater reliability is the correlation
between their scores
Generalized you get an intraclass coefficient (or
coefficient of concordance) as the average
correlation between many raters

29
How much reliability is enough?

As usual, in this uncertain world there is no
simple answer to this simple question
Alphas for personality tests (0.46 - 0.96) tend
to be lower than alphas for achievement and
aptitude tests (0.66 - 0.98)
If you are comparing groups means, modest alphas
of 0.6 to 0.7 may be sufficient 0.8 is better
If you want to make claims about differences
between single individuals, you need to have more
reliable scores alphas of 0.85 or better
If life-changing decisions are to be made, you
want reliabilities of 0.9 or 0.95 (but you will
have trouble finding them in many fields!)

30
How can we increase reliability?

Analyze your items
Bad items increase error (thats what it means to
be a bad item!), therefore they decrease
reliability
Increase the number of items
Longer tests are generally more reliable (Why?)
This is why error from subtests does not sum to
total error We decrease our error when we have
more items
Factor analyze
Unidimensional tests are more reliable (Why?)
Factor analyze to find if you are looking for
spurious reliability

31
How can we increase reliability?

Distinguish between real variability in what you
are measuring, and variability due to error (eg.
Test-retest correlations may be irrelevant if the
thing being measured is changing)
FYI You can figure out how many items you need
to get a given reliability using a generalization
of Spearmans prophecy formula

32
Validity vs. reliability
Image from http//trochim.human.cornell.edu/kb I
highly recommend this site.

Write a Comment

User Comments (0)