Construct Validity: A Universal Validity System

About This Presentation

Title:

Construct Validity: A Universal Validity System

Description:

Content validity is not up to the burden of defining what is measured by a test ... Speeded math test to emphasize automatic numerical processes ... – PowerPoint PPT presentation

Number of Views:168

Avg rating:3.0/5.0

Slides: 54

Provided by: psych179

Category:

more less

Transcript and Presenter's Notes

Title: Construct Validity: A Universal Validity System

1
Construct Validity A Universal Validity System

Susan Embretson
Georgia Institute of Technology
University of Maryland
Conference on the Concept of Validity

2
Introduction

Validity is a controversial concept in
educational and psychological testing
Research on educational and psychological tests
during the last half of the 20th century was
guided by distinction of types of validity
Criterion-related validity, content validity and
construct validity
Construct validity is the most problematic type
of validity
It involves theory and the relationship of data
to theory

3
Introduction

Yet the most controversial type of validity
became the sole type of validity in the revised
joint standards for educational and psychological
tests (AERA/APA/NCME, 1999)
In the current standards Validity refers to the
degree to which evidence and theory support the
interpretations of test scores entailed by
proposed uses of test
Content validity and criterion-related validity
are two of five different kinds of evidence.
Reflects substantial impact from Messicks (1989)
thesis of a single type of validity (construct
validity) with several different aspects.

4
Topics

Overview of the validity concept
Current issues on validity
Discontent with construct validity for
educational tests
Need for content validity
Critique of content validity as basis for
educational testing
Universal system for construct validity
Applies to all tests
Achievement tests
Ability tests
Personality/psychopathology
Summary

5
History of the Construct Validity Concept
Origins

American Psychological Association (1954).
Technical recommendations for psychological tests
and diagnostic techniques. Psychological
Bulletin, 51, 2, 1-38.
Prepared by a joint committee of the American
Psychological Association, American Educational
Research Association, and National Council on
Measurements Used in Education.
Validity information indicates to the test user
the degree to which the test is capable of
achieving certain aims. Thus, a vocabulary
test might be used simply as a measure of present
vocabulary, as a predictor of college success, as
a means of discriminating schizophrenics from
organics, or as a means of making inferences
about "intellectual capacity.
We can distinguish among the four types of
validity by noting that each involves a different
emphasis on the criterion. (p. 13)

6
Implications of Original Views

Same test can be used in different ways
Relevant type of validity depends on test use
The types of validity differ in the importance of
the behaviors involved in the test

7
More Recent Views on Types of Validity

Standards for Educational and Psychological
Testing (1954 1966 1974, 1985, 1999)
1985
Traditionally, the various means of accumulating
validity evidence have been grouped into
categories called content-related,
criterion-related and construct-related evidence
of validity. These categories are
convenient.but the use of category labels does
not imply that there are distinct types of
validity
An ideal validation includes several types of
evidence, which span all three of the traditional
categories.

8
Conceptualizations of Validity Psychological
Testing Textbooks

All validity analyses address the same basic
question Does the test measure knowledge and
characteristics that are appropriate to its
purpose. There are three types of validity
analysis, each answering this question in a
slight different way. (Friedenberg,1995)
..the types of validity are potentially
independent of one another. (Murphy
Davidshofer,1988)
There are three types of evidence (1)
construct-related, (2) criterion-related, and (3)
content-related. ..It is important to
emphasize that categories for grouping different
types of validity are convenient however, the
use of categories does not imply that there are
distinct forms of validity. Kaplan Saccuszzo
(1993)

9
Most Recent View on Validity

Standards for Educational Psychological Testing
1999
Validity refers to the degee to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests. (p.9)
These sources of evidence may illuminate
different aspects of validity, but they do not
represent distinct types of validity. Validity
is a unitary concept.
The wide variety of tests and circumstances
makes it natural that some types of evidence will
be especially critical in a given case, whereas
other types will be less useful. (p. 9)
Because a validity argument typically depends on
more than one proposition, strong evidence in
support of one in no way diminishes the need for
evidence to support others. (p. 11).

10
Implications of 1999 Validity Concept

No distinct types of validity
Multiple sources of evidence for single test aim
Example-Mathematical achievement test used to
assess readiness for more advanced course
Propositions for inference
1) Certain skills are prerequisite for advanced
course
2) Content domain structure for the test
represents skills
3) Test scores represent domain performance
4) Test scores are not unduly influenced by
irrelevant variables, such as writing ability,
spatial ability, anxiety etc.
5) Success in advanced course can be assessed
6) Test scores are related to success in advanced
curriculum

11
Current Issues with the Validity Concept
Educational Testing

Lissitz and Samuelson (2007)
Propose some changes in terminology and emphasis
in the validity concept
Argue that construct validity as it currently
exists has little to offer test construction in
educational testing.
In fact, their system leads to a most startling
conclusion
Construct validity is irrelevant to defining what
is measured by an educational test!!
Content validity becomes primary in determining
what an educational test measures

12
Critique of Content Validity as Basis for
Educational Testing

Content validity is not up to the burden of
defining what is measured by a test
Relying on content validity evidence, as
available in practice, to determine the meaning
of educational tests could have detrimental
impact on test quality
Giving content validity primacy for educational
tests could lead to very different types and
standards of evidence for educational and
psychological tests

13
Validity in Educational Tests Response to Lissitz
Samuelson

Background
Embretson, S. E. (1983). Construct validity
Construct representation versus nomothetic span.
Psychological Bulletin, 93, 179-197.
Construct representation
Establishes the meaning of test scores from
Identifying the theoretical mechanisms that
underlie test performance (i.e., the processes,
strategies and knowledge)
Nomothetic span
Establishes the significance of test scores by
Identifying the network of relationships of test
scores with other variables

14
Validity in Lissitz and Samuelsons Framework

Taxonomy of test evaluation procedures
1) Investigative Focus
Internal sources analysis of the test and its
items
Provides evidence about what is measured
External sources relationship of test scores to
other measures criteria
Provides evidence about impact, utility and trait
theory
2) Perspective
Theoretical orientation concern with measuring
traits
Practical orientation concern with measuring
achievement

15
Figure 2. Taxonomy of Test Evaluation Procedures

Perspective

Theoretical Practical
Internal Latent Process Content and Reliability
External Nomological Network Utility and Impact
16
Figure 1. The Structure of the Technical
Evaluation of Educational Testing
17
Implications for Validity

System represents best current practices
Internal meaning (validity) established
For educational tests, content and reliability
evidence
Evidence based on internal structure (i.e.,
reliability, etc.)
Evidence based on test content
For psychological tests, depends on latent
processes
Evidence based on response processes
Evidence based on internal structure (item
correlations)
But, notice the limitations
Response process and test content evidence are
not relevant to both types of tests
External evidence based on relations to other
variables has no role in validity

18
Internal Evidence for Educational Tests Part I

Reliability concept in the Lissitz and Samuelson
framework is generally multifaceted and
traditional
Item interrelationships
Relationship of test scores over conditions or
time
Differential item functioning (DIF)
Adverse impact
(Perhaps adverse impact and DIF could be
considered as external information)

19
Internal Evidence for Educational Tests Part II

Concept of Content Validity
Previous test standards (1985)
Content validity was a type of evidence that
..demonstrates the degree to which a sample of
items, tasks or questions on a test are
representative of some defined universe or domain
of content
Two important elements added by LS
Cognitive complexity level
whether the test covers the relevant
instructional or content domain and the coverage
is at the right level of cognitive complexity
Test development procedures
Information about item writer credentials and
quality control

20
Test Blueprints as Content Validity Evidence

Blueprints represent domain strcture by
specifying percentages of test items that should
fall in various categories
Example- test blueprint for NAEP for mathematics
Five content strands
Three levels of complexity
Majority of states employ similar strands
But, blueprints and other forms of test
specifications (along with reliability evidence)
are not sufficient to establish meaning for an
educational test

21
1. Domain Structure is a Theory Which Changes
Over Time

NAEP framework, particularly for cognitive
complexity, has evolved (NAGB, 2006)
Views on complexity level also may change based
on empirical evidence, such as item difficulty
modeling, task decomposition and other methods
Changes in domain structure also could evolve in
response to recommendations of panels of experts.
National Mathematics Advisory Panel

22
2. Reliability of Classifications is Not Well
Documented

Scant evidence that items can be reliably
classified into the blueprint categories
Certain factors in an achievement domain may make
these categorizations difficult
For example, in mathematics a single real-world
problem may involve algebra and number sense, as
well as measurement content
Item could be classified into three of the five
strands.
Similarly, classifying items for mathematical
complexity also can be difficult
Abstract definitions of the various levels in
many systems

23
3. Unrepresentative Samples from Domain

Practical limitations on testing conditions may
lead to unrepresentative samples of the content
domain
More objective item formats, such as multiple
choice and limited constructed response have long
been favored
Reliably and inexpensively scored
But these formats may not elicit the deeper
levels of reasoning that experts believe should
be assessed for the subject matter

24
4. Irrelevant Item Solving Processes

Using content specifications, along with item
writer credentials and item quality control, may
not be sufficient to assure high quality tests
Leighton and Gierl (2007) view content
specifications as one of three cognitive models
for making inferences about examinees thinking
processes
For the cognitive model of test specifications
for inferences is that no evidence is provided
that examinees are in fact using the presumed
skills and knowledge to solve items

25
NAEP Validity Study for Mathematics Grade 4 and
Grade 8

Mathematicians examined items from NAEP and some
state accountability tests
Small percent of items deemed flawed (3-7),
Larger percent of items deemed marginal (23-30)
Marginal items had construct-irrelevant
difficulties
problems with pattern specifications
unduly complicated presentation
unclear or misleading language
excessively time-consuming processes
Marginal items previously had survived both
content-related and empirical methods of
evaluation

26
Examples of Irrelevant Knowledge, Skills and
Abilities

Source
National Mathematics Advisory Panel (2008).
Foundations for success The final report of the
National Mathematics Advisory Panel. Washington,
DC Department of Education
Method- logical-theoretical analysis by
mathematicians curriculum experts
Mathematics involves aspects of logical analysis,
spatial ability and verbal reasoning, yet their
role can be excessive

27
Dependence on Non-Mathematical Knowledge
28
Dependence on Logic, Not Mathematics
29
Excessive Dependence on Spatial Ability
30
Excessive Dependence on Reasoning and Minimal
Mathemataics
31
Implication for Educational Tests

Identifying irrelevant sources of item
performance requires more than content-related
evidence
Latent process evidence is relevant
E.g., methods include cognitive analysis (e.g.,
item difficulty modeling), verbal reports of
examinees and factor analysis
External sources of evidence may provide needed
safeguards
Example Implications of the correlation of an
algebra test with a test of English
If this correlation is too high, it may suggest a
failure in the system of internal evidence that
supports test meaning

32
Construct Validity as a Universal System and a
Unifying Concept

Features
Consistent with current Test Standards (1999)
Consistent with many of Lissitz and Samuelsons
distinctions and elaborations
Validity Concept
Universal
All sources of evidence are included
Appropriate for both educational and
psychological tests
Interactive
Evidence in one category is influenced or
informed by adequacy in the other categories

33
Categories of Evidence in the Validity System

Eleven categories of evidence
Categories apply to both educational and
psychological tests
Consistent with most validity frameworks and the
current Test Standards (1999), it is postulated
that tests differ in which categories in the
system are most crucial to test meaning,
depending on its intended use
Even so, most categories of evidence are
potentially relevant to a test

34
A Universal Validity System
Scoring Models
Item Design Principles
Other Measures
Testing Conditions
Psycho- metric Properties
Test Specs
Latent Process Studies
Utility
Domain Structure
Impact
Logic/ Theory
Internal ?Meaning
External?Significance
35
Internal Categories of Evidence
Logic/Theoretical Analysis Theory of the subject matter content, specification of areas and their interrelationships
Latent Process Studies Studies on content interrelationships, prerequisite skills, impact of task features testing conditions on responses, etc.
Testing Conditions Available test administration methods, scoring mechanisms (raters, machine scoring, computer algorithms), testing time, locations, etc.
Item Design Principles Scientific evidence and knowledge about how features of items impact the KSAs applied by examinees-- Formats, item context, complexity and specific content
36
Internal Categories of Evidence
Domain Structure Specification of content areas and levels, as well as relative importance and interrelationships
Test Specifications Blueprints specifying domain structure representation, constraints on item features, specification of testing conditions
Psychometric Properties Item interrelationships, DIF, reliability, relationship of item psychometric properties to content stimulus features, reliability
Scoring Models Psychometric models and procedures to combine responses within and between items, weighting of items, item selection standards, relationship of scores to proficiency categories, etc. Decisions about dimensionality, guessing, elimination of poorly fitting items etc. impacts scores and their relationships
37
External Categories of Evidence
Utility Relationship of scores to external variables, criteria categories
Other Measures Relationship of scores to other tests of knowledge, skills and abilities
Impact Consequences of test use, adverse impact, proficiency levels etc

38
The Universal System of Validity

Test Specifications is the most essential
category it determines (with Scoring Models)
Representation of domain structure
Psychometric properties of the test
External relationships of test scores
Preceding Test Specifications are categories of
scientific evidence, knowledge and theory
Domain Structure
Item Design Principles
In turn preceded by
Latent Process Studies
Logical/Theoretical Analysis
Testing Conditions

39
General Features of Validity System

Test meaning is determined by internal sources of
information
Test significance is determined by external
sources of information
Content aspects of the test are central to test
meaning
Test specifications, which includes test content
and test development procedures, have a central
role in determining test meaning
Test specifications also determine the
psychometric properties of tests, including
reliability information

40
General Features of the Universal Validity System

Broad system of evidence is relevant to support
Test Specifications
Item Design Principles --Relevancy of examinees
responses to the intended domain
Domain Structure --Regarded as a theory
Other preceding evidence
Latent Process Studies
Logical/theoretical analyses of the domain
Testing Conditions

41
General Features of the Universal Validity System

Interactions among components
Internal evidence ? expectations for external
External evidence informs adequacy of evidence
from internal sources
Potential inadequacies arise when
Hypotheses are not confirmed
Unintended consequences of test use
System of evidence includes both theoretical and
practical elements
Relevant to educational and psychological tests

42
The Universal System of Validity

Example of Feedback
Speeded math test to emphasize automatic
numerical processes
External evidence-- strong adverse impact
Internal evidence categories to question
Item Design
Relationship of item speededness to automaticity
Domain Structure
Heavy emphasis on the automaticity of numerical
skills

43
Application to Educational and Psychological
Tests Achievement

Current emphasis
Test specification
Central to standards-based testing
Domain structures
Essential to blueprints
Scoring models Psychometric properties
State-of-art in large scale testing
Underemphasized areas
Item design principles
Research basis is emerging
Latent process studies
Important in establishing construct-relevancy of
student responses
Logical/Theoretical Analysis
Important in defining domain structure
Implications of feedback from studies on
Utility, Other Measures, Impact

44
Application to Educational and Psychological
Tests Achievement

Example Item Design Latent Process Studies
Item response format for mathematics items
Katz, I.R., Bennett, R.E., Berger, A.E. (2000).
Effects of response format on difficulty of
SAT-Mathematics items Its not the strategy.
Journal of Educational Measurement, 37(1), 39-57.
Mathematical non-mathematical item content
National Mathematics Advisory Panel

45
Application to Educational and Psychological
Tests Personality

Current emphasis
Logical/Theoretical Analysis
I.e., personality theories
Utility
Prediction of job performance
Other Measures
Factor analytic studies
Underemphasized areas
Test Specifications
Domain Structure
Item Design Principles
Latent Process Studies

46
Application to Educational and Psychological
Tests Personality

Test Specifications Domain Structure
Ignoring domain structure ? Lack of convergent
validity
Multifaceted personality constructs
Unbalanced or uncontrolled item set
Best represented facet emphasizied in item
selection
Item selection will not be consistent
Example Conscientiousness construct facets
Dependabilty, Achievement (Moutafi et al 2006)
Opposing relationship to commitment
Duty (-), Achievement Striving ()

47
Application to Educational and Psychological
Tests Personality

Test Specifications Domain Structure
Example of structure in personality
Facet theory to
Define domain membership
Define domain structure observations
Roskam, E. Broers, N. (1996). Constructing
questionnaires An application of facet design
and item response theory to the study of
lonesomeness. In G. Engelhard M. Wilson
(Eds.). Objective Measurement Theory into
Practice Volume 3. Norwood, NJ Ablex Publishing.
Pp. 349-385.

48
Facet Theory Approach to Measure of Lonesomeness
49
Application to Educational and Psychological
Tests Personality

Item Design Principles Latent Process Studies
Most measures are self-report format
Basis of self-report may involve strong
construct-irrelevant aspects
Tasks require judgments about relevance of
statement to own behavior and then reliably
summarizing
California Psychological Inventory items
When in a group of people I usually do what the
others want rather than make suggestions
There have been a few times when I have been very
mean to another person.
I am a good mixer.
I am a better talker than listener.

50
Application to Educational and Psychological
Tests Personality

Science of self-report is emerging and linked to
cognitive psychology
Stone, A. A., Turkkan, J. S., Bachrach, C.A.,
Jobe, J. B., Kurtzman, H. S. Cain, V. S.
(2000). The science of self-report. Mahwah, NJ
Erlbaum Publishers.
Studies on how item and test design impacts
self-report accuracy
Self-reports under optimal conditions are biased
Daily diaries of dietary self-reports contain
insufficient calories to sustain life
Smith, A. F., Jobe, J. B., Mingay, D.
M. (1991b). Retrieval from memory of dietary
information. Applied Cognitive Psychology, 5,
269-296.
Personality inventories are far less optimal for
reliable reporting