Title: Validity
1Validity Outline
- Definition
- Validity Two Different Views
- Types of Validity
- Face
- Content
- Criterion
- Predictive vs. Concurrent
- Validity Coefficients
- Construct
- Convergent
- Discriminant
2Validity Definition
- Validity measures agreement between a test score
and the characteristic it is believed to measure
- The basic question is are you measuring what you
think youre measuring?
3Validity two very different views
- Traditional
- Validity is a property of tests
- Does the test measure what you think it measures?
4Validity two very different views
- Traditional
- Recent (e.g, Messick, 1989 Committee on
Standards for Educational and Psychological
Testing (CSEPT))
- Validity is a property of test score
interpretations - Validity exists when actions based on the
interpretation are justified given a theoretical
basis and social consequences
5Note the difference
- Does the test measure what you think it measures?
- Validity exists when actions based on the
interpretation are justified given a theoretical
basis and social consequences
6A problem with the CSEPT view
- Who is to say the social consequences of test
use are good or bad?
- According to CSEPT validity is a subjective
judgment - In my view, this makes the concept useless if
you like the result the test gives you, you will
consider it valid. If you dont, you wont. - Thats not how scientists think.
7Borsboom et al. (2004)
- Borsboom et al reject CSEPTs view
- Validity is a very basic concept and was
correctly formulated, for instance, by Kelley
(1927, p. 14) when he stated that a test is valid
if it measures what it purports to measure. (p.
8Borsboom et al. (2004)
- a test is valid for measuring an attribute if
and only if (a) the attribute exists and (b)
variations in the attribute causally produce
variations in the outcomes of the measurement
- Variations in what you are measuring cause
variations in your measurements. - E.g., variations across people in intelligence
cause variations in their IQ scores - This is not a correlational model of validity
9Borsboom et al. (2004)
- You dont create a test and then do the analysis
necessary to establish its validity
- Rather, you begin by doing the theoretical work
necessary to create a valid test in the first
place. - On this view, validity is not a big issue.
10Borsboom et al. vs. CSEPT
- Who is right?
- Each scientist has to make up his or her own
mind on that question
- I find Borsboom et al.s arguments compelling.
- Other psychologists may disagree
11The CSEPT view
- CSEPT recognizes 3 types of evidence for test
validity - Content-related
- Criterion-related
- Construct-related
- Boundaries not clearly defined
- Cronbach (1980) Construct is basic, while
Content Criterion are subtypes.
12Parenthetical Point Face Validity
- Face validity refers to the appearance that a
test measures what it is intended to measure.
- Face validity has P.R. value test-takers may
have better motivation if the test appears to be
a sensible way to measure what it measures.
13CSEPT Content validity
- Content-related evidence considers coverage of
the conceptual domain tested.
- Important in educational settings
- Like face validity, it is determined by logic
rather than statistics - Typically assessed by expert judges
14CSEPT Content validity
- Content-related evidence considers coverage of
the conceptual domain tested. - Construct-irrelevant variance
- Construct under-representation
- Is each item relevant to domain?
- Is domain adequately covered or are parts of it
left out? - But if you are going to ask these questions, why
not do it when creating the test?
15Borsboom et al. Content validity
- Borsboom et al. would say that content validity
is not something to be established after the test
has been created.
- Rather, you build it into your test by having a
good theory of what you are testing - E.g., for a test in this course to have content
validity, it should test your understanding of
content validity!
16CSEPT Criterion validity
- Criterion-related evidence tells us how well a
test score corresponds to a particular criterion
- A criterion is a standard against which a test is
compared. - The test score should tell us something about the
criterion score.
17CSEPT Criterion validity
- A criterion is a standard against which a test is
- E.g., we could compare GPAs to SAT scores to
produce evidence of validity of conclusions drawn
on basis of SAT scores - Two basic types
- Predictive
- Concurrent
18CSEPT Criterion validity
- Test scores used to predict future performance
how good is the prediction? - E.g., SAT is used to predict final undergraduate
GPA - SAT GPA are moderately correlated
19CSEPT Criterion validity
- Predictive validity
- Concurrent validity
- Correlation between test scores and criterion
when the two are measured at same time. - Test illuminates current performance rather than
predicting future performance (e.g., why does
patient have a temperature? Why cant student do
20Borsboom et al. Criterion validity
- Criterion validity involves a correlation, of
test scores with some criterion such as GPA
- That does not establish the tests validity, only
its utility. - E.g., height and weight are correlated, but a
test of height is not a test of what bathroom
scales measure.
21Borsboom et al. Criterion validity
- SAT is valid because it was developed on the
sensible theory that past academic achievement
is a good guide to future academic achievement
- Validity is built into the test, not established
after the test has been created
22Borsboom et al. Criterion validity
- Validation research aims at showing how variation
in the attribute causes variation in the test
- This requires a theory of the task how does
the test-taker do the mental operations needed to
respond to test items?
23CSEPT Criterion validity
- Note no point in developing a test if you
already have a criterion unless impracticality
or expense makes use of the criterion difficult.
- Criterion measure only available in the future?
- Criterion too expensive to use?
24CSEPT Criterion validity
- Compute correlation (r) between test score and
criterion. - r .30 or .40 would be considered normal.
- r gt .60 is rare
- Note r varies between -1.0 and 1.0
25CSEPT Criterion validity
- r2 gives proportion of variance in criterion
explained by test score. - E.g., if rxy .30, r2 .09, so 9 of
variability in Y can be explained by variation
in X
26CSEPT Criterion validity
- Interpreting Validity Coefficients watch out
- Changes in causal relationships
- What does criterion mean? Is it valid, reliable?
- Is subject population for validity study
appropriate? - Sample size
27CSEPT Criterion validity
- Interpreting Validity Coefficients watch out
- Criterion/predictor confusion
- Range restrictions
- Do validity study results generalize?
- Differential predictions
28CSEPT Construct validity
- Problem for many psychological characteristics
of interest there is no agreed-upon universe of
content and no clear criterion
- We cannot assess content or criterion validity
for such characteristics - These characteristics involve constructs
something built by mental synthesis.
29CSEPT Construct validity
- Examples of constructs
- Intelligence
- Love
- Curiosity
- Mental health
- CSEPT We obtain evidence of validity by
simultaneously defining the construct and
developing instruments to measure it. - This is bootstrapping.
30Bootstrapping construct validity
- assemble evidence about what a test means in
other words, about the characteristic it is
- CSEPT this process is never finished
- Borsboom this is part of the process of creating
a test in the first place, not something done
after the fact
31Bootstrapping construct validity
- assemble evidence
- show relationships between a test and other tests
- none of the other tests is a criterion
- Borsboom these relationships do not tell us what
a test score means - (e.g., age is correlated with annual income but a
measure of age is not a measure of annual income).
32Bootstrapping construct validity
- assemble evidence
- show relationships
- each new relationship adds meaning to the test
- tests meaning is gradually clarified over time
- Borsboom would say, why all the mystery? The
meaning of many tests (e.g., WAIS, academic
exams, Piagets tests) is clear right from the
33CSEPT Construct validity
- Example from text Rubins work on Love.
- Rubin collected a set of items for a Love scale
- He read poetry, novels asked people for
definitions - created a scale of Love and one of Liking
34CSEPT Construct validity
- Rubin gave scale to many subjects
factor-analyzed results
- Love integrates Attachment, Caring, Intimacy
- Liking integrates Adjustment, Maturity, Good
Judgment, and Intelligence - The two are independent you can love someone you
dont like (as song-writers know)
35Campbell Fiske (1959)
- Two types of Construct-related Evidence
- Convergent evidence
- When a test correlates well with other tests
believed to measure the same construct
36Campbell Fiske (1959)
- Two types of Construct-related Evidence
- Convergent evidence
- Discriminant evidence
- When a test does not correlate with other tests
believed to measure some other construct.
37Convergent validity
- Scores correlated with age, number of symptoms,
chronic medical conditions, physiological
measures - Treatments designed to improve health should
increase Health Index scores. They do.
38Discriminant validity
- low correlations between new test and tests
believed to tap unrelated constructs.
- evidence that the new test measures something
39CSEPT Validity Reliability
- CSEPT No point in trying to establish validity
of an unreliable test.
- Its possible to have a reliable test that has no
meaning (is not valid). - Logically impossible to produce evidence of
validity for an unreliable test.
40Borsboom Validity Reliability
- Borsboom et al what does it mean to say that a
test is reliable but not valid?
- What is it a test of?
- It isnt a test at all, just a collection of
41Borsboom Validity Reliability
- Borsboom et al validity is a necessary condition
for reliability
- Reliability of a test of X estimates precision of
measurement of X but how could you estimate the
precision of measurement of X for a test that
does not measure X? - Thus, validity is presumed when you assess
42Blanton Jaccard arbitrary metrics
- We observe a behavior in order to learn about the
underlying psychological characteristic - A persons test score represents their standing
on that underlying dimension
- Such scores form an arbitrary metric
- That is, we do not know how the observed scores
are related to the true scores on the underlying
43Person A
Person B
Underlying dimension
Test 1
Test 2
Adapted from Blanton Jaccard (2006) Figure 1,
p. 29
44Arbitrary metrics the IAT
- Implicit Association Test (IAT) claimed to
diagnose implicit attitudinal preferences or
racist attitudes
- IAT authors say you may have prejudices you dont
know you have. - Are these claims true?
45Arbitrary metrics the IAT
- Task categorize stimuli using two pairs of
- Two buttons to press, two assignments of
categories to buttons, used in sequence
46Arbitrary metrics the IAT
- Assignment pattern A
- Button 1 press if stimulus refers to the
category White or the category Pleasant - Button 2 press if stimulus refers to the
category Black or the category Unpleasant
- Assignment pattern B
- Button 1 press if stimulus refers to the
category White or the category Unpleasant - Button 2 press if stimulus refers to the
category Black or the category Pleasant
47Arbitrary metrics the IAT
- IAT authors claim that if responses are faster to
Pattern A than to Pattern B, that indicates a
preference for Whites over Blacks in other
words, a racist attitude
- IAT authors also give test-takers feedback about
how strong their preferences are, based on how
much faster their responses are to Pattern A than
to Pattern B - This is inappropriate
48Arbitrary metrics the IAT
- The IAT does not tell us about racist attitudes
- IAT authors take a dimension which is
non-arbitrary when used by physicists time
and use it in an arbitrary way in psychology
49Arbitrary metrics the IAT
- The function relating the response dimension
(time) to the underlying dimension (attitudes) is
- Zero on the (Pattern A Pattern B) difference
may not be zero on the underlying attitude
preference dimension - There are alternative models of how that (Pattern
A Pattern B) difference could arise
- Validity is a characteristic of evidence, not of
tests. - Valid evidence supports conclusions drawn using
test results - Validity is determined by social consequences of
test use
- Borsboom et al.
- Validity is not a methodological issue, but a
substantive (theoretical) issue - A test of an attribute is valid if (a) the
attribute exists, and (b) variation in the
attribute causes variation in test scores
- Validity can be established in three ways, though
boundaries between them are fuzzy - Content-related evidence
- Criterion-related evidence
- Construct-related evidence
- Borsboom et al
- Its all the same validity a test is valid if it
measures what you think it measures - Validity is not mysterious
- Content-related evidence do test items represent
whole domain of interest? - Criterion-related evidence do test scores relate
to a criterion either now (concurrent) or in the
future (predictive)?
- Borsboom et al.
- These questions are properly part of the process
of creating a test
- Construct-related evidence is obtained when we
develop a psychological construct and the way to
measure it at the same time. - A test can be reliable but not valid. A test
cannot be valid if not reliable.
- Borsboom et al.
- A test must be valid for a reliability estimate
to have any meaning
- Blanton Jaccard (2006) warn against
over-interpretation of scores which are based on
an arbitrary metric
- For an arbitrary metric, we have no idea how the
test scores are actually related to the
underlying dimension