Title: Validity
1Validity Outline
- Definition
- Validity Two Different Views
- Types of Validity
- Face
- Content
- Criterion
- Predictive vs. Concurrent
- Validity Coefficients
- Construct
- Convergent
- Discriminant
2Validity Definition
- Validity measures agreement between a test score
and the characteristic it is believed to measure
- The basic question is are you measuring what you
think youre measuring?
3Validity two very different views
- Traditional
- Validity is a property of tests
- Does the test measure what you think it measures?
4Validity two very different views
- Traditional
- Recent (e.g, Messick, 1989 Committee on
Standards for Educational and Psychological
Testing (CSEPT))
- Validity is a property of test score
interpretations - Validity exists when actions based on the
interpretation are justified given a theoretical
basis and social consequences
5Note the difference
- Does the test measure what you think it measures?
- Validity exists when actions based on the
interpretation are justified given a theoretical
basis and social consequences
6A problem with the CSEPT view
- Who is to say the social consequences of test
use are good or bad?
- According to CSEPT validity is a subjective
judgment - In my view, this makes the concept useless if
you like the result the test gives you, you will
consider it valid. If you dont, you wont. - Thats not how scientists think.
7Borsboom et al. (2004)
- Borsboom et al reject CSEPTs view
- Validity is a very basic concept and was
correctly formulated, for instance, by Kelley
(1927, p. 14) when he stated that a test is valid
if it measures what it purports to measure. (p.
1061)
8Borsboom et al. (2004)
- a test is valid for measuring an attribute if
and only if (a) the attribute exists and (b)
variations in the attribute causally produce
variations in the outcomes of the measurement
procedure.
- Variations in what you are measuring cause
variations in your measurements. - E.g., variations across people in intelligence
cause variations in their IQ scores - This is not a correlational model of validity
9Borsboom et al. (2004)
- You dont create a test and then do the analysis
necessary to establish its validity
- Rather, you begin by doing the theoretical work
necessary to create a valid test in the first
place. - On this view, validity is not a big issue.
10Borsboom et al. vs. CSEPT
- Who is right?
- Each scientist has to make up his or her own
mind on that question
- I find Borsboom et al.s arguments compelling.
- Other psychologists may disagree
11The CSEPT view
- CSEPT recognizes 3 types of evidence for test
validity - Content-related
- Criterion-related
- Construct-related
- Boundaries not clearly defined
- Cronbach (1980) Construct is basic, while
Content Criterion are subtypes.
12Parenthetical Point Face Validity
- Face validity refers to the appearance that a
test measures what it is intended to measure.
- Face validity has P.R. value test-takers may
have better motivation if the test appears to be
a sensible way to measure what it measures.
13CSEPT Content validity
- Content-related evidence considers coverage of
the conceptual domain tested.
- Important in educational settings
- Like face validity, it is determined by logic
rather than statistics - Typically assessed by expert judges
14CSEPT Content validity
- Content-related evidence considers coverage of
the conceptual domain tested. - Construct-irrelevant variance
- Construct under-representation
- Is each item relevant to domain?
- Is domain adequately covered or are parts of it
left out? - But if you are going to ask these questions, why
not do it when creating the test?
15Borsboom et al. Content validity
- Borsboom et al. would say that content validity
is not something to be established after the test
has been created.
- Rather, you build it into your test by having a
good theory of what you are testing - E.g., for a test in this course to have content
validity, it should test your understanding of
content validity!
16CSEPT Criterion validity
- Criterion-related evidence tells us how well a
test score corresponds to a particular criterion
measure.
- A criterion is a standard against which a test is
compared. - The test score should tell us something about the
criterion score.
17CSEPT Criterion validity
- A criterion is a standard against which a test is
compared.
- E.g., we could compare GPAs to SAT scores to
produce evidence of validity of conclusions drawn
on basis of SAT scores - Two basic types
- Predictive
- Concurrent
18CSEPT Criterion validity
- Test scores used to predict future performance
how good is the prediction? - E.g., SAT is used to predict final undergraduate
GPA - SAT GPA are moderately correlated
19CSEPT Criterion validity
- Predictive validity
- Concurrent validity
- Correlation between test scores and criterion
when the two are measured at same time. - Test illuminates current performance rather than
predicting future performance (e.g., why does
patient have a temperature? Why cant student do
math?)
20Borsboom et al. Criterion validity
- Criterion validity involves a correlation, of
test scores with some criterion such as GPA
- That does not establish the tests validity, only
its utility. - E.g., height and weight are correlated, but a
test of height is not a test of what bathroom
scales measure.
21Borsboom et al. Criterion validity
- SAT is valid because it was developed on the
sensible theory that past academic achievement
is a good guide to future academic achievement
- Validity is built into the test, not established
after the test has been created
22Borsboom et al. Criterion validity
- Validation research aims at showing how variation
in the attribute causes variation in the test
score
- This requires a theory of the task how does
the test-taker do the mental operations needed to
respond to test items?
23CSEPT Criterion validity
- Note no point in developing a test if you
already have a criterion unless impracticality
or expense makes use of the criterion difficult.
- Criterion measure only available in the future?
- Criterion too expensive to use?
24CSEPT Criterion validity
- Compute correlation (r) between test score and
criterion. - r .30 or .40 would be considered normal.
- r gt .60 is rare
- Note r varies between -1.0 and 1.0
25CSEPT Criterion validity
- r2 gives proportion of variance in criterion
explained by test score. - E.g., if rxy .30, r2 .09, so 9 of
variability in Y can be explained by variation
in X
26CSEPT Criterion validity
- Interpreting Validity Coefficients watch out
for
- Changes in causal relationships
- What does criterion mean? Is it valid, reliable?
- Is subject population for validity study
appropriate? - Sample size
27CSEPT Criterion validity
- Interpreting Validity Coefficients watch out
for
- Criterion/predictor confusion
- Range restrictions
- Do validity study results generalize?
- Differential predictions
28CSEPT Construct validity
- Problem for many psychological characteristics
of interest there is no agreed-upon universe of
content and no clear criterion
- We cannot assess content or criterion validity
for such characteristics - These characteristics involve constructs
something built by mental synthesis.
29CSEPT Construct validity
- Examples of constructs
- Intelligence
- Love
- Curiosity
- Mental health
- CSEPT We obtain evidence of validity by
simultaneously defining the construct and
developing instruments to measure it. - This is bootstrapping.
30Bootstrapping construct validity
- assemble evidence about what a test means in
other words, about the characteristic it is
testing.
- CSEPT this process is never finished
- Borsboom this is part of the process of creating
a test in the first place, not something done
after the fact
31Bootstrapping construct validity
- assemble evidence
- show relationships between a test and other tests
- none of the other tests is a criterion
- Borsboom these relationships do not tell us what
a test score means - (e.g., age is correlated with annual income but a
measure of age is not a measure of annual income).
32Bootstrapping construct validity
- assemble evidence
- show relationships
- each new relationship adds meaning to the test
- tests meaning is gradually clarified over time
- Borsboom would say, why all the mystery? The
meaning of many tests (e.g., WAIS, academic
exams, Piagets tests) is clear right from the
start
33CSEPT Construct validity
- Example from text Rubins work on Love.
- Rubin collected a set of items for a Love scale
- He read poetry, novels asked people for
definitions - created a scale of Love and one of Liking
34CSEPT Construct validity
- Rubin gave scale to many subjects
factor-analyzed results
- Love integrates Attachment, Caring, Intimacy
- Liking integrates Adjustment, Maturity, Good
Judgment, and Intelligence - The two are independent you can love someone you
dont like (as song-writers know)
35Campbell Fiske (1959)
- Two types of Construct-related Evidence
- Convergent evidence
- When a test correlates well with other tests
believed to measure the same construct
36Campbell Fiske (1959)
- Two types of Construct-related Evidence
- Convergent evidence
- Discriminant evidence
- When a test does not correlate with other tests
believed to measure some other construct.
37Convergent validity
- Scores correlated with age, number of symptoms,
chronic medical conditions, physiological
measures - Treatments designed to improve health should
increase Health Index scores. They do.
38Discriminant validity
- low correlations between new test and tests
believed to tap unrelated constructs.
- evidence that the new test measures something
unique
39CSEPT Validity Reliability
- CSEPT No point in trying to establish validity
of an unreliable test.
- Its possible to have a reliable test that has no
meaning (is not valid). - Logically impossible to produce evidence of
validity for an unreliable test.
40Borsboom Validity Reliability
- Borsboom et al what does it mean to say that a
test is reliable but not valid?
- What is it a test of?
- It isnt a test at all, just a collection of
items
41Borsboom Validity Reliability
- Borsboom et al validity is a necessary condition
for reliability
- Reliability of a test of X estimates precision of
measurement of X but how could you estimate the
precision of measurement of X for a test that
does not measure X? - Thus, validity is presumed when you assess
reliability
42Blanton Jaccard arbitrary metrics
- We observe a behavior in order to learn about the
underlying psychological characteristic - A persons test score represents their standing
on that underlying dimension
- Such scores form an arbitrary metric
- That is, we do not know how the observed scores
are related to the true scores on the underlying
dimension
43Person A
Person B
Underlying dimension
Neutral
Test 1
0
1
2
3
4
5
6
Test 2
6
5
4
3
2
1
0
Adapted from Blanton Jaccard (2006) Figure 1,
p. 29
44Arbitrary metrics the IAT
- Implicit Association Test (IAT) claimed to
diagnose implicit attitudinal preferences or
racist attitudes
- IAT authors say you may have prejudices you dont
know you have. - Are these claims true?
45Arbitrary metrics the IAT
- Task categorize stimuli using two pairs of
categories
- Two buttons to press, two assignments of
categories to buttons, used in sequence
46Arbitrary metrics the IAT
- Assignment pattern A
- Button 1 press if stimulus refers to the
category White or the category Pleasant - Button 2 press if stimulus refers to the
category Black or the category Unpleasant
- Assignment pattern B
- Button 1 press if stimulus refers to the
category White or the category Unpleasant - Button 2 press if stimulus refers to the
category Black or the category Pleasant
47Arbitrary metrics the IAT
- IAT authors claim that if responses are faster to
Pattern A than to Pattern B, that indicates a
preference for Whites over Blacks in other
words, a racist attitude
- IAT authors also give test-takers feedback about
how strong their preferences are, based on how
much faster their responses are to Pattern A than
to Pattern B - This is inappropriate
48Arbitrary metrics the IAT
- The IAT does not tell us about racist attitudes
- IAT authors take a dimension which is
non-arbitrary when used by physicists time
and use it in an arbitrary way in psychology
49Arbitrary metrics the IAT
- The function relating the response dimension
(time) to the underlying dimension (attitudes) is
unknown
- Zero on the (Pattern A Pattern B) difference
may not be zero on the underlying attitude
preference dimension - There are alternative models of how that (Pattern
A Pattern B) difference could arise
50Review
- CSEPT
- Validity is a characteristic of evidence, not of
tests. - Valid evidence supports conclusions drawn using
test results - Validity is determined by social consequences of
test use
- Borsboom et al.
- Validity is not a methodological issue, but a
substantive (theoretical) issue - A test of an attribute is valid if (a) the
attribute exists, and (b) variation in the
attribute causes variation in test scores
51Review
- CSEPT
- Validity can be established in three ways, though
boundaries between them are fuzzy - Content-related evidence
- Criterion-related evidence
- Construct-related evidence
- Borsboom et al
- Its all the same validity a test is valid if it
measures what you think it measures - Validity is not mysterious
52Review
- CSEPT
- Content-related evidence do test items represent
whole domain of interest? - Criterion-related evidence do test scores relate
to a criterion either now (concurrent) or in the
future (predictive)?
- Borsboom et al.
- These questions are properly part of the process
of creating a test
53Review
- CSEPT
- Construct-related evidence is obtained when we
develop a psychological construct and the way to
measure it at the same time. - A test can be reliable but not valid. A test
cannot be valid if not reliable.
- Borsboom et al.
- A test must be valid for a reliability estimate
to have any meaning
54Review
- Blanton Jaccard (2006) warn against
over-interpretation of scores which are based on
an arbitrary metric
- For an arbitrary metric, we have no idea how the
test scores are actually related to the
underlying dimension