Title: Chapter 4
1Chapter 4 Reliability
- Observed Scores and True Scores
- Error
- How We Deal with Sources of Error
- Domain sampling test items
- Time sampling test occasions
- Internal consistency traits
- Reliability in Observational Studies
- Using Reliability Information
- What To Do about Low Reliability
2Chapter 4 - Reliability
- Measurement of human ability and knowledge is
challenging because
- ability is not directly observable we infer
ability from behavior - all behaviors are influenced by many variables,
only a few of which matter to us
3Observed Scores
- O Observed score
- T True score
- e error
4Reliability the basics
- A true score on a test does not change with
repeated testing - A true score would be obtained if there were no
error of measurement.
- We assume that errors are random (equally likely
to increase or decrease any test result).
5Reliability the basics
- Because errors are random, if we test one person
many times, the errors will cancel each other out - (Positive errors cancel negative errors)
- Mean of many observed scores for one person will
be the persons true score
6Reliability the basics
- Example to measure Sarahs spelling ability for
English words. - We cant ask her to spell every word in the OED,
so
- Ask Sarah to spell a subset of English words
- correct estimates her true English spelling
skill - But which words should be in our subset?
7Estimating Sarahs spelling ability
- Suppose we choose 20 words randomly
- What if, by chance, we get a lot of very easy
words cat, tree, chair, stand - Or, by chance, we get a lot of very difficult
words desiccate, arteriosclerosis, numismatics
8Estimating Sarahs spelling ability
- Sarahs observed score varies as the difficulty
of the random sets of words varies
- But presumably her actual spelling ability
remains constant.
9Reliability the basics
- Other things can produce error in our measurement
- E.g. on the first day that we test Sarah shes
tired - but on the second day, shes rested
10Estimating Sarahs spelling ability
- Conclusion
- O T e
- But e1 ? e2 ? e3
- The variation in Sarahs scores is produced by
measurement error. - How can we measure such effects how can we
measure reliability?
11Reliability the basics
- In what follows, we consider various sources of
error in measurement.
- Different ways of measuring reliability are
sensitive to different sources of error.
12How do we deal with sources of error?
13How do we deal with sources of error?
- Error due to test items
- Error due to testing occasions
14How do we deal with sources of error?
- Error due to test items
- Error due to testing occasions
- Error due to testing multiple traits
- Internal consistency error
15Domain Sampling error
- A knowledge base or skill set containing many
items is to be tested. - E.g., chemical properties of foods.
- We cant test the entire set of items.
- So we sample items.
- That produces sampling error, as in Sarahs
spelling test.
16Domain Sampling error
- Smaller sets of items may not test entire
knowledge base. - a persons score may vary depending upon what is
included or excluded from test.
- Reliability increases with number of items on a
test
17Domain Sampling error
- Parallel Forms Reliability
- choose 2 different sets of test items.
- Across all people tested, if correlation between
scores on 2 sets of words is low, then we
probably have domain sampling error.
18Time Sampling error
- Test-retest Reliability
- person taking test might be having a very good
or very bad day due to fatigue, emotional
state, preparedness, etc.
- Give same test repeatedly check correlations
among scores - High correlations indicate stability less
influence of bad or good days.
19Time sampling error
- Advantage easy to evaluate, using correlation
- Disadvantage carryover practice effects
20Internal Consistency error
- Suppose a test includes both items on social
psychology and items requiring mental rotation of
abstract visual shapes.
- Would you expect much correlation between scores
on the two parts? - No because the two skills are unrelated.
21Internal Consistency Approach
- A low correlation between scores on 2 halves of a
test, suggests that the test is tapping two
different abilities or traits.
- A good test has high correlations between scores
on its two halves. - But how should we divide the test in two to check
that correlation?
22Internal Consistency error
- Split-half method
- Kuder-Richardson formula
- Cronbachs alpha
- All of these assess the extent to which items on
a given test measure the same ability or trait.
23Split-half Reliability
- After testing, divide test items into halves A
B that are scored separately. - Check for correlation of results for A with
results for B.
- Various ways of dividing test into two
randomly, first half vs. second half, odd-even
24Split-half Reliability a problem
- Each half-test is smaller than the whole
- Smaller tests have lower reliability (domain
sampling error)
- So, we shouldnt use the raw split-half
reliability to assess reliability for the whole
test
25Split-half reliability a problem
- We correct reliability estimate using the
Spearman-Brown formula - re 2rc
- 1 rc
- re estimated reliability for the test
- rc computed reliability (correlation between
scores on the two halves A and B)
26Kuder-Richardson 20
- Kuder Richardson (1937) an internal-consistency
measure that doesnt require arbitrary splitting
of test into 2 halves.
- KR-20 avoids problems associated with splitting
by simultaneously considering all possible ways
of splitting a test into 2 halves.
27Kuder-Richardson 20
- The formula contains two basic terms
- a measure of all the variance in the whole set of
test results.
28Kuder-Richardson 20
- The formula contains two basic terms
- item variance when items measure the same
trait, they co-vary (same people get them right
or wrong). More co-variance less item variance
29Internal Consistency Cronbachs a
- KR-20 can only be used with test items scored as
1 or 0 (e.g., right or wrong, true or false).
- Cronbachs a (alpha) generalizes KR-20 to tests
with multiple response categories. - a is a more generally-useful measure of internal
consistency than KR-20
30Review How do we deal with sources of error?
- Approach Measures Issues
- Test-Retest Stability of scores Carryover
-
- Parallel Forms Equivalence Stability Effort
- Split-half Equivalence Internal Shortened con
sistency test - KR-20 a Equivalence Internal Difficult to
- consistency calculate
31Reliability in Observational Studies
- Some psychologists collect data by observing
behavior rather than by testing.
- This approach requires time sampling, leading to
sampling error - Further error due to
- observer failures
- inter-observer differences
32Reliability in Observational Studies
- Deal with possibility of failure in the
single-observer situation by having more than 1
observer.
- Deal with inter-observer differences using
- Inter-rater reliability
- Kappa statistic
33Reliability in Observational Studies
- agreement between 2 or more observers
- problem in a 2-choice case, 2 judges have a 50
chance of agreeing even if they guess! - this means that agreement may over-estimate
inter-rater reliability.
34Reliability in Observational Studies
- Kappa Statistic (Cohen,1960)
- estimates actual inter-rater agreement as a
proportion of potential inter-rater agreement
after correction for chance.
35Using Reliability Information
- Standard error of measurement (SEM)
- estimates extent to which test score
misrepresents a true score. - SEM (S)?(1 r)
36Standard Error of Measurement
- We use SEM to compute a confidence interval for a
particular test score.
- The interval is centered on the test score
- We have confidence that the true score falls in
this interval - E.g., 95 of the time the true score will fall
within 1.96 SEM either way of the test (observed)
score.
37What to do about low reliability
- Increase the number of items
- To find how many you need, use Spearman-Brown
formula - Using more items may introduce new sources of
error such as fatigue, boredom
38What to do about low reliability
- Discriminability analysis
- Find correlations between each item and whole
test - Delete items with low correlations