Title: Overview
1Predictive Tests
2Overview
- Introduction
- Some theoretical issues
- The failings of human intuitions in prediction
- Issues in formal prediction
- Inference from class membership The individual
versus group problem (and its only solution) - Some well-known predictive tests
- Prediction in science and psychometrics
3Predictive Tests
- Many tests are used to make predictions, of
levels of achievement or success, or of
likelihood of recidivism, or diagnostic category - Two kinds of predictions
- Categorical Predict which category this subject
will fall into (diagnosis, occupation) - Numerical Predict the value of a relevant
numerical value (GPA, economic return to company)
4The failings of human intuition
- We have already seen many ways in which humans
succumb to errors in numerical reasoning - Kahneman Tversky Asked subjects about areas of
graduate specialization base rate estimation,
estimates (from a description) of similarity to
other students in each field, and predictive
estimate (also from a description)
5Results
- Results
- Similarity and prediction correlate at 0.97
- Similarity and base rates correlate at -0.65
- What does this result remind you of?
- What do these subjects need to be taught?
66 Errors discussed by Kahneman Tversky
- Representativeness error Assumes predictions are
not different from assessments of similarity - Insufficient regression error People fail to
take into account that when predictive validity
is less than perfect, correlations between
predictors and performance should be lt 1 - Central tendency error Subjects making judgments
tend to avoid extremes, and compress their
judgments into a smaller range than the
phenomenon being judged
76 Errors discussed by Kahneman Tversky
- Discounting of prior probabilities Human
predictors will throw out base rate information
for almost any reason - Overweighting of coherence There is greater
confidence in predictions based on consistent
input than inconsistent input with the same
average (i.e. two B's is better than a B C for
predicting a B average) - Overweighting of extremes Confidence in judgment
is over-weighted at extremes, especially positive
extremes ( j-shaped confidence function)
8What do we need to make good predictions?
- We need three pieces of information
- 1.) Base rates
- 2.) Relevant predictors in the individual case
- 3.) Bounds on accuracy (cutting scores)
- Kahneman Tversky's experimental evidence
(previous slides) show that subjects usually fail
to weight any of these three properly
9Review Measuring validation error
- Coefficient of alienation (or coefficient of
non-determination) k (1 - r2), where r is
correlation of test score with some predicted
performance - k the proportion of the error inherent in
guessing that your estimate has (percent of
variance not accounted for) - If k 1.0, you have 100 of the error youd have
had if you just guessed (since this means your r
was 0) - If k 0, you have achieved perfection your r
was 1, and there was no error at all - If k 0.6, you have 60 of the error youd have
had if you guessed
N.B. This never happens.
10Why should we care?
- We care because r/k are useful in interpreting
accuracy of an individuals scores - r 0.6 (good), k 0.64 (not good)
- r 0.7 (great), k 0.51 (not so great)
- r 0.9 (fantastic!), k 0.19 (so so)
11Why should we care?
- Since even high values of r (0.9) leave a fairly
large proportion of variance unaccounted for, the
prediction of any individuals criterion score is
always accompanied by a wide margin of error - Recall Smr S (1 - r)0.5 --gt Individual error
margins are a function of how good our
correlation is - The moral Predicting individual performance is
really hard to do!
12What can we infer from class membership?
- Some commentators have suggested that inference
from class membership is inherently fallacious - i.e. 25 of first-degree relatives of those
diagnosed with malignant melanoma (skin cancer)
will also develop melanoma - I am a first-degree relative of two persons
diagnosed with melanoma, so I take my odds of
developing the disease to be gt 25 - Critics of the inference say No, it is either 0
(I don't develop the disease) or 100 (I do)
i.e. group probabilities don't apply to
individuals
13Do group probabilities apply to individuals?
- Meehl's response "If nothing is rationally
inferable from membership in a class, no
empirical prediction is ever possible" - The argument is a re-statement of the necessity
of inference even in the case of predicting
individual behavior from that individual's data,
we need to consider the pattern over past data - Moreover, claim of 'certainty' is philosophical,
not real in the absence of knowing which group
you are in, there is only probability, not
certain knowledge
14"One incident that occurred while future Nobel
Laureate Kenneth Arrow was forecasting the
weather illustrates both uncertainty and the
human unwillingness to accept it. Some officers
had been assigned the task of forecasting the
weather a month ahead, but Arrow and his
statisticians found that their long-range
forecasts were no better than numbers pulled out
of a hat. The forecasters agreed and asked their
superiors to be relieved of this duty. The reply
was 'The Commanding General is well aware that
the forecasts are no good. However, he needs them
for planning purposes'." Peter
Bernstein Against The Gods- The Remarkable Story
of Risk
15Some Predictive Tests Standardized admission
tests
- The Scholastic Aptitude Tests (SAT, GREs) are
highly reliable tests developed to painstaking
psychometric standards - The reference norm group changes every year The
reference group for 2003 scores was based on
examinees from 1998-2001 and the reference group
for 2004 scores was based on examinees from
1999-2002. - For this reason, the same score may have a
(slightly) different percentile rank in one year
than in another
16The Graduate Record Exam General
- The GRE is a computerized standardized test taken
by individuals applying to graduate school. - Its purpose is to measure the acquired skills of
the test taker, and to predict performance in
graduate school. - The general GRE has four sections
- Verbal Section 30 questions, 30 minutes
- Quantitative Section 28 questions, 45 minutes
- Analytical Writing Section 2 Analytical Writing
Tasks - 45-minute "Present Your Perspective on an Issue"
task) - 30-minute "Analyze an Argument" task
- Research sections
- The test is timed, and corrected for guessing
- It is also computer adaptive questions depend
on answers
17The Graduate Record Exam Writing
- Score on a 6 point scale (mean SD 4.18
0.97) - 6 Insightful analyses of complex ideas,
logically compelling, well organized, skillful
sentence variety, few or no usage errors - 5 Generally thoughtful analysis of ideas,
logically sound reasons, generally well
organized, sentence variety conveys meaning,
minor usage errors - 4 Competent analysis of ideas, relevant
reasons, adequately organized, satisfactory
control of sentence structure, some usage errors - 3 Some competence but flawed by at least one
of limited analysis or development, weak
organization or control of sentence structure,
usage errors that result in vagueness - 2 Serious weakness in at least one of lack
of analysis, development, or organization,
serious problems in sentence structure, usage
errors obscure meaning - 1 Fundamental deficiencies content that is
confusing or irrelevant, little or no
development, pervasive errors that result in
incoherence
18Sample Verbal Questions
- Analogies
- ETERNAL END
- a. precursory beginning
- b. grammatical sentence
- c. implausible credibility
- d. invaluable worth
- e. frenetic movement
19Sample Verbal Questions
- Sentence Completions
- Museums, which house many paintings and
sculptures, are good places for students of
_____. - a. art
- b. science
- c. religion
- d. dichotomy
- e. democracy
-
20Sample Verbal Questions
- Antonyms
- MALADROIT
- a. ill-willed
- b. dexterous
- c. cowardly
- d. enduring
- e. sluggish
21Sample Quantitative Questions
- Quantitative Comparison
- Column A y-6 Column B -3
- If y gt 2
- a. the quantity in column A is always greater
- b. the quantity in column B is always greater
- c. the quantities are always equal
- d. It cannot be determined from the information
given
22Sample Quantitative Questions
- Problem Solving
- The sum of x distinct integers greater than zero
is less than 75. What is the greatest possible
value of x ? - a. 8
- b. 9
- c. 10
- d. 11
- e. 12
23Sample Analytical Questions
- A pastry shop will feature 5 desserts-- V,W,X,Y
Z-- to be served Monday thru Friday, one dessert
a day, that conforms to the following
restrictions - Y must be served before V.
- X and Y must be served on consecutive days.
- Z may not be the second dessert to be served.
24The Graduate Record Exam Subject
- The subject test has 220 5-choice multiple choice
questions - Currently have subject tests in Biochemistry,
Cell and Molecular Biology Biology Chemistry
Computer Science Literature in English
Mathematics Physics Psychology - In psychology
- 43 Experimental/natural science
- 43 social science
- 14 general
25Reliability
- Within-test reliability 0.9
- Test re-test reliability is not so good Repeat
test takers for both tests show an average score
gain of 20-30 points - This may move a student by a large amount more
than 10 percentiles - Standard error of measurement of about 35 points
26Validity
- In one meta-analysis by Sternberg and Williams,
they point out that empirical validities of the
GRE vary somewhat by field - Tests correlate with each other
- Verbal and quantitative 0.45
- Quantitative and analytical 0.66
- GRE correlations between various combinations of
GRE scores and grad school performance are only
between 0.25 and 0.35, and only marginally better
(0.4) if you include undergraduate grades
27Validation Correlations of GRE Scores
28Correlations of GRE Scores
- You can estimate your IQ from GRE/SAT scores at
- http//members.shaw.ca/delajara/GREIQ.html
- GRE VQ 1240 IQ 130
- N.B. I have no idea how valid this sites
claims are.
29Subject Test Validity
- Kuncel, N. R., Hezlett, S. A., Ones, D. S.
(2001). A comprehensive meta-analysis of the
predictive validity of the graduate record
examinations Implications for graduate student
selection and performance. Psychological
Bulletin, 127 (1), 162-181. - N 1,753 studies, together covering 82,659
graduate students - Subject Tests tended to be better predictors than
the Verbal, Quantitative, and Analytical tests. - GRE correlations with degree attainment and
research productivity were consistently positive
however, some lower 90 credibility intervals
included 0.
30Construct Validity
- Does the GRE get at anything related to graduate
school? - What about motivation, creativity, devotion,
conscientiousness, and other aspects that make a
successful graduate student? - Some complaints
- Graduate assignments require that students
develop research skills, but GRE does not test
this - GRE is timed but real life is rarely timed
- GRE is individualised but real work usually
involves collaboration
31Why is the GRE so popular?
- Because is in the public eye
- Since average scores for admissions on tests such
as the GRE are published, there is pressure on
schools to keep the average scores of the
students that they accept high so that they can
remain competitive with other institutions in
the public eye - One strength of the GR that they have specific
regression equation by college i.e. they can
predict future performance at a particular
college independently - Because there is relatively little variation in
their reference letters and undergraduate GPA,
GRE scores are one main sources of the variation
that is needed to rank applicants - P.S. A new GRE is scheduled to come out in 2006
32The Scholastic Aptitude Test
- The SAT is a set of tests
- SAT I includes the Verbal and Math tests, whose
scores are summed to get the total score - SAT II has tests in 12 subject fields
- Like the GRE, the SAT test is timed and corrected
for guessing - Range for each subtest (Verbal/Math) is 200-800
(mean SD 500 100)
33The Scholastic Aptitude Test
- First normed in 1941, re-normed in 1995 on a more
carefully-chosen group - There was an 80 point increase in verbal at most
score ranges (e.g., an 1941 score of 500 would
now be 580) - Math scores were up by about 40 points at lower
ranges only
34Some Predictive Tests The SAT
- Internal reliability 0.90
- Standard error of about 30 points
- SAT r 0.4 with university GPA
- By comparison, high school grade r 0.48
- Together, r 0.55
35Can you beat the standards?
- Notwithstanding the huge industry waiting to take
money from anxious high school students, studying
for the SAT doesn't help much - SAT coaching increases scores by about 15 points,
which is 0.15 SDs - Repeat testing increases it a little less, about
12 points or 0.12 SDs - How much should we pay for 0.1 SDs?
36Some Predictive Tests Professional tests
- Professional school tests (MCAT, LSAT)
- MCAT r low .80s
- LSAT r gt 0.9
- There is relatively little evidence of validity
- They predict performance about as well as
undergraduate GPA alone r 0.25 - 0.3
37Some Predictive Tests The Strong Interest
Inventory
- The Strong (1927) Interest Inventory
(Strong-Campbell, 1981) widely used test of
interests as predictors of professional aptitude - Empirically constructed with concurrent validity,
comparing each vocational group to the overall
average - Has 325 items, 162 scales covering 85 occupations
- Reliability is high
- 0.9test/retest over weeks 0.6-0.7 over years
unless they were old ( 25 years) at first test,
then 0.8 even after 20 years - Does not predict success or satisfaction in a
profession - Does predict likelihood of entering and remaining
in a profession chances of 50 that a person
will end up in a profession most strongly
predicted (A score), and only 12 that he will
end in one least predicted (C score)
38Prediction in scientific psychology
- Prediction scientific explanation are related
- We admire Newton's laws precisely because they
are accurate in predicting real phenomena - Many cognitive models in psychology are weak
because they are purely descriptive they fail to
make an effort to predict how a person will
perform on unseen stimuli - There are many ways to do so, if you have
sufficient variation in predictors multiple
regression, neural networks, 'cheap' methods
(i.e. best single predictor)
39Some lessons about scientific prediction
- Models can 'cheat' by using variance in the input
data set that does not transfer to unseen data
you must test your predictions on unseen data (
cross-validation) - Some models that are very good may be very good
precisely because they are very good at using
this 'within-set' variation - Even very simple non-linear models may do as well
or better than than much more complex models,
especially linear models - Eg. r 0.48 (validation set r 0.58)
- Linear regression r 0.22 (validation set r
0.20) - They may exclude highly-correlated variables
- Different measures of successful prediction may
yield quite different results (i.e. test
correlation versus correlation after binning into
0.5 SD intervals)
40Some lessons about scientific prediction
- Linear assumptions may be limiting You may hide
variance just by taking on the assumption - More predictive power may sometimes (perhaps
often) be obtained by dropping the assumptions
of linear relations between predictors and the
quality to be predicted