Title: Psychometrics: An introduction
1PsychometricsAn introduction
2Overview
- A brief history of psychometrics
- The main types of tests
- The 10 most common tests
- Why psychometrics? Clinical versus actuarial
judgment
3A brief history
- Testing for proficiency dates back to 2200 B.C.,
when the Chinese emperor used grueling tests to
assess fitness for office
4Francis Galton
- Modern psychometrics dates to Sir Francis Galton
(1822-1911), Charles Darwins cousin
- Interested in (in fact, obsessed with)
individual differences and their distribution - 1884-1890 Tested 17000 individuals (!) on
height, weight, sizes of accessible body parts,
behavior hand strength, visual acuity, RT etc - Demonstrated that objective tests could provide
meaningful scores - Invented correlation First regression line was
the average diameter of seeds against the average
diameter of their parents
5Regression to the mean
- Galton also popularized the idea of regression to
the mean extreme values when repeated tend to be
less extreme
Francis Galton (1886). Regression Towards
Mediocrity in Hereditary Stature. Journal of the
Anthropological Institute 15 246263.
6Regression to the mean
- So Second albums by great bands tend to be worse
than first albums second novels by successful
first novelists tend to be worse than first
novels sports teams who excelled in one game or
season tend do worse in the next game/season
geniuses have childen who are less brilliant than
they are etc. - WHY?
7Regression to the mean
- I had the most satisfying Eureka experience of
my career while attempting to teach flight
instructors that praise is more effective than
punishment for promoting skill-learning. When I
had finished my enthusiastic speech, one of the
most seasoned instructors in the audience raised
his hand and made his own short speech, which
began by conceding that positive reinforcement
might be good for the birds, but went on to deny
that it was optimal for flight cadets. He said,
"On many occasions I have praised flight cadets
for clean execution of some aerobatic maneuver,
and in general when they try it again, they do
worse. On the other hand, I have often screamed
at cadets for bad execution, and in general they
do better the next time. So please don't tell us
that reinforcement works and punishment does not,
because the opposite is the case." This was a
joyous moment, in which I understood an important
truth about the world because we tend to reward
others when they do well and punish them when
they do badly, and because there is regression to
the mean, it is part of the human condition that
we are statistically punished for rewarding
others and rewarded for punishing them. I
immediately arranged a demonstration in which
each participant tossed two coins at a target
behind his back, without any feedback. We
measured the distances from the target and could
see that those who had done best the first time
had mostly deteriorated on their second try, and
vice versa. But I knew that this demonstration
would not undo the effects of lifelong exposure
to a perverse contingency. - Daniel Kahneman (In his Nobel acceptance speech)
8James Cattell
- James Cattell (studied with Wundt Galton)
first used the term mental test in 1890
- His tests were in the brass instruments
tradition of Galton - mostly motor and acuity tests
- Founded Psychological Review(1897)
9Clark Wissler
- Clark Wissler (Cattells student) did the first
basic validational research, examining the
relation between the old mental test scores and
academic achievement
- His results were largely discouraging
- He had only bright college students in his
sample - Why is this a problem?
- Wissler became an anthropologist with a strong
environmentalist bias.
10Alfred Binet
- Goodenough (1949) The Galtonian approach was
like inferring the nature of genius from the the
nature of stupidity or the qualities of water
from those of.hydrogen and oxygen.
- Alfred Binet (1905) introduced the first modern
intelligence test, which directly tested higher
psychological processes (real abilities
practical judgments) - i.e. picture naming, rhyme production, weight
ordering, question answering, word definition. - Also motivated IQ (Stern, 1914) mental age
divided by chronological age
11The rise of psychometrics
- Lewis Terman (1916) produced a major revision of
Binets scale - Robert Yerkes (1919) convinced the US government
to test 1.75 million army recruits - Post WWI Factor analysis emerged, making other
aptitude and personality tests possible
12What is a psychometric test?
- A test is a standardized procedure for sampling
behavior and describing it using scores or
categories - Most tests are predictive of some non-test
behavior of interest (or what would be the
point?) - Most tests are norm-referenced they describe
the behavior in terms of norms, test results
gathered from a large group of subjects (the
standardization sample) - Some tests are criterion-referenced the
objective is to see if the subject can attain
some pre-specified criterion.
13The main types of tests
- Intelligence tests Assess intelligence
- Aptitude tests Assess capability
- Achievement tests Assess degree of
accomplishment - Creativity tests Assess capacity for novelty
- Personality tests Assess traits
- Interest inventories Assess preferences for
activities - Behavioral tests Measure behaviors and their
antecedents/consequences - Neuropsychological tests Measure cognitive,
sensory, perceptual, or motor functions
14The 10 most commonly used tests
- 1.) Wechsler Intelligence Scale for Children
(WISC) - 2.) Bender Visual-Motor Gestalt Test
- 3.) Wechsler Adult Intelligence Scale (WAIS)
- 4.) Minnesota Multiphasic Personality Inventory
(MMPI) - 5.) Rorschach Ink Blot Test
- 6.) Thematic Apperception Test (TAT)
- 7.) Sentence Completion
- 8.) Goodenough Draw-A-Person Test
- 9.) House-Tree-Person Test
- 10.) Stanford-Binet Intelligence Scale
- From Brown McGuire, 1976
15Clinical versus actuarial judgment
- Clinical judgment reaching a decision by
processing information in ones head - Actuarial judgment reaching a decision without
employing human judgment, using
empirically-established relations between data
and the event of interest - Actuarial ad. L. actu amac ri-us, a keeper
of accounts - Note that some of the data in an actuarial
judgment may be qualitative clinical
observations, allowing a mixture of methods
16Clinical versus actuarial judgment
- Paul Meehl (1954) first addressed the question
Which is better?
- His ground rules for comparison
- Both methods should draw from the same data set
(this was relaxed by others, with no changes in
results) - Cross-validation should be required, to avoid
using variation specific to the data set - There should be explicit prediction of success,
recidivism, or recovery
17Meehl (1954) Results
- He looked at between 16 and 20 studies (depending
on inclusion criteria) - it is clear that the dogmatic, complacent
assertion sometimes heard from clinicians that
naturally clinical prediction, being based on
real understanding is superior, is simply not
justified by the facts to date. - In all but one case, predictions made by
actuarial means were equal to or better than
clinical methods - In a later paper, he changed his mind about the
one!
18Thirty years later...
- Review and reflection indicate that no more than
5 of what was written in the 1954 book entitled,
Clinical Versus Statistical Prediction needs to
be retracted 30 years later. If anything, these
retractions would result in the book's being more
actuarial than it was. - There is no controversy in social science that
shows such a large body of qualitatively diverse
studies coming out so uniformly as this one. - Paul Meehl, 1986 (Causes and Effects of My
Disturbing Little Book)
19In 1989
- After eliminating studies that might be biased
against clinicians, by 1989 there were
approximately 100 studies that pitted actuarial
against clinical methods - In virtually every one of these studies, the
actuarial method has equaled or surpassed the
clinicla method, sometimes substantially - Dawes, Faust, Meehl, 1989 In your course pack
20Example Goldbergs Rule
- Goldbergs Rule (1965) gives a simple formula for
diagnosing psychosis versus neurosis from MMPI
scale scores (we will see these scales later) - It was derived by looking at gold standard
discharge diagnoses - It was compared to 29 judges on 861 profiles from
7 settings - Judges got an average of 62 correct
- The best judge got 67 correct
- Goldbergs Rule got 70 correct, and exceeded
judges in every one of the 7 settings - Additional training didnt help the judges do
better (and note also that the judges knew and
could have used Goldbergs Rule!)
21Where are clinicians strengths? I
- i.) Theory-mediated judgments
- If the predictor knows the relevant causal
influences, can measure them, and has a model
specific enough to take him/her from theory to
fact - However, are there any reasons to doubt this
potential advantage?
22Where are clinicians strengths? II
- ii.) Ability to use rare events
- If the predictor knows that the current case is
an exception to the statistical trend, s/he can
use that information to over-ride the trend - It is also possible to build these into actuarial
methods - Why is it very difficult in practice?
- Why might we worry about clinicians ability to
incorporate rare events into prediction?
23Where are clinicians strengths? III
- iii.) Able to detect complex predictive cues
- - Humans beings are still (for now) masters at
recognizing some complex configurations, such as
facial expressions etc.
24Where are clinicians strengths? IV
- iv.) Able to re-weight utilities in real-time
- - For ethical, legal, humanitarian, or financial
reasons, we might decide to do things differently
than usual in particular cases.
25Where are actuarial strengths? I
- i.) Immunity from fatigue, forgetfulness,
hang-overs, hostility, prejudice, ignorance,
false association, over-confidence, bias,
heart-ache, and random fluctuations in judgment.
26Where are actuarial strengths? II
- ii.) Consistency proper weighting
- - Variables are weighted the same way every
time, according to their actual demonstrable
contributions to the criterion of interest - - Perhaps more importantly irrelevant variables
are properly weighted to zero
27Where are actuarial strengths? III
- iii.) Feedback base-rates built-in to the
system - - Clinicians rarely know how they are doing
because they dont get immediate feedback and
because they have imperfect memory - - Actuarial records constitute perfect memories
of how things came out in similar cases and can
include a larger and wider sample than a single
human or a small group of humans can ever hope to
see
28Where are actuarial strengths? IV
- iv.) Not overly sensitive to optimal weightings
- - Even simplistic actuarial judgments often beat
human judgments - - Simple linear weightings often do better than
humans - v.) Optimal (non-linear) weightings are
possible.
29The power of non-linearity
- Linear relations are those that say that X goes
up by the same amount for each equal sized
increments in Y - P aX bY c
- Such equations are represented graphically by a
straight line relating X and Y or any higher
number of dimensions - Non-linear relations are those that say that X
goes up by different amounts for each equal sized
increments in Y (there are many many such
equations) - Such equations are represented graphically by a
non-straight line relating X and Y either
because the line breaks or because it curves
30The power of non-linearity
- Westbury, C., Buchanan, L., Sanderson, M.,
Rhemtulla, M., Phillips, L. (2003). Using
genetic programming to discover non-linear
variable interactions. Behavior Research Methods,
Instruments, and Computers, 352 202-216. - We used computational means to discover
non-linear weightings for a test (constructed for
PSYCO 431) which looked at the construct of
geekiness the extent to which a person is a
geek. - This test was validated against a self-rating on
a Likert scale. - The test consisted of 76 questions.
- The validation set contained 59 subjects
- -The test set contained 30 subjects.
31The power of non-linearity (and the need for
cross-validation)
- The non-linear estimate was about as good at
predicting scores on unseen tests as the (gold
standard) summed validation score around which
the test had been designed - It blew away the linear regression (0.56 versus
0.20) - The non-linear combination used responses to
only 12 of the 76 test questions in its
prediction.