Title: Test Development and Use: New Twists on Old Questions
1Test Development and Use New Twists on Old
Questions
- Wayne F. Cascio
- PTSC, October 21, 2005
2Source
- Cascio, W. F., Aguinis, H. (2005, Fall). Test
development and use New twists on old questions.
Human Resource Management, 44(3), 219-235
3Purpose
- Update research findings in 5 key areas
- Validity generalization
- Statistical significance testing
- Criterion measures
- Cutoff scores
- Cross-validation
- Distill guidelines for practice
- Each area is controversial, but
- General agreement about guidelines
4Validity Generalization
- Variability across studies in validities might
not represent genuine differences - Of 7 potential sources of variability, sampling
error is most important - Lesson the mean of several validity coefficients
is a better basis for inferring validity than is
any one coefficient (SIOP, 2003)
5Legal Status of VG
- Only 3 cases that relied on VG evidence have
reached the Appeals-Court level - Courts do not always accept VG evidence
- Atlas Paper Box Co. (1989)
- Atlas did no job analyses
- Expert never visited the company
- Expert argued that measures of G are always valid
- 6th Circuit noted As a matter of law, Hunters
VG theory is totally unacceptable under relevant
case law and profl standards.
6VG Guidelines for Practice
- Provide evidence that jobs and contexts are
similar to those in a VG study - Dont rely on VG as the sole basis for defending
a test, if challenged - Does the VG study describe clearly the predictors
and criteria being assessed? - Does the VG study include all publicly available
studies in the domain of interest? - Are the variables selected/coded based on a
priori theoretical grounds?
7VG Guidelines for Practice
- Did multiple raters apply the coding scheme?
Inter-rater reliabilities? - Does the VG study include all variables analyzed,
including moderators? - Are the characteristics of studies included
reported in as much detail as possible, so
readers can assess generalizations that are
appropriate?
8Statistical Significance Testing
- There is heated debate over the use of null
hypothesis significance testing - Two practical issues
- How significance tests are used
- Are significance tests alone sufficient to
justify using a test, or is additional
information needed?
9Use of Statistical Significance Testing
- Allows us to infer if a null hypothesis of no
test-criterion relationship is likely to be false - Used incorrectly when
- A result at the .01 level is interpreted as a
larger effect than one at the .05 level - Failure to reject the n.h. in the sample is
interpreted as evidence of lack of relationship
in the population
10Guidelines for Practice
- Report confidence intervals significance levels
shows the practical significance of the effect - Observed vs. corrected validities
- Use observed when the objective is to understand
validity evidence for a specific
predictor-criterion relationship - Use corrected when objective is to understand
relationships among constructs
11Guidelines for Practice (cont.)
- Be very clear about what significance tests
provide and what they do not provide (strength of
effect) - Can low correlations still be useful? Yes
utility depends on SDy, SR, and cost of test - Compute statistical power to rule out possibility
that sample was too small to detect an effect
that actually was present (Type II error)
12Measures of Performance (Criteria)
- Primary standard for choosing a criterion
Relevance - Cant get useful information if criteria are
deficient or contaminated - Other aspects of criteria
- Dynamism
- Typical vs. maximum performance
- Multi-dimensionality
13Dynamism of Criteria
- Does the rank order of individuals on a criterion
change over time? - Keil Cortina (2001) deterioration of validity
over time is a ubiquitous phenomenon - Guidelines for practice
- ID and understand variables that cause
performance to change over time - Tests of general vs. narrow ability tend to
predict criteria for longer periods of time - But, changes in complexity and consistency of
tasks over time reduce validity
14Dynamism of Criteria - Guidelines
- In some organizations e.g., services,
fast-paced work environments - task performance
is likely to change over time, while contextual
performance tends to remain stable - Lesson predictors of contextual performance,
more so than task performance, are likely to be
valid longer
15Typical Versus Maximum Performance
- Measures of max perf. what employees can do)
correlate only slightly with measures of typical
perf. what employees will do. - Do the objectives of a validation study focus on
typical or maximum performance? - Guidelines for practice
- Selection procedures ? Maximum perf.
- Criteria ? Typical perf.
16Typical v. Maximum Performance
- Lack of alignment in predictor-criterion
- constructs may prevent development of tests with
high predictive validity - The focus of a validation study should be
determined, in part, by whether it includes a
measure of typical or maximum performance as a
criterion
17Multi-Dimensionality of Criteria
- Since job performance is multi-dimensional
- Criterion measures also should be
- Two-dimensional taxonomy
- Task performance
- Contextual performance
- Guidelines for practice
- Include both types as criteria
- Example work environments where technology
requires constant learning of new tools in a
cooperative context work in teams
18Cutoff Scores
- Sometimes not needed (e.g., top-down selection)
- State or local governing bodies might require
them for licensure, certification, promotion, or
graduation - Setting minimum standards
- Multiple-choice format
- Angoff procedure
19Cutoff Scores (cont.)
- Constructed-response tests
- Analytical judgment method (AJM)
- Panelists given examinee work samples for each
question - Ratings for each question below basic, basic,
proficient, advanced - W/in category ratings low, middle, high
- Discussion to resolve discrepancies
- Re-rating
20Cutoff Scores (cont.)
- AJM To calculate a performance standard for
proficient - Work samples rated as low, middle, and high are
used to calculate an average score (performance
standard) - Process is repeated for basic and advanced
standards, and across all questions - Total assessment stds. sum over standards
(basic, proficient, advanced) for all questions
21Cutoff Scores (cont.)
- AJM has been piloted in 3 states
- Easy to use, results in cut scores panelists see
as appropriate - Need more validity data to examine accuracy of
performance standards
22Cut Scores Guidelines
- There is no single best method for all situations
- ID critical levels of proficiency on KSAOs
- (e.g., using AJM)
- Use a 10-20 sample of SMEs with a broad
cross-section of experience - Consider errors of measurement and A.I.
- E.g., set cut score 1SEM below mean incumbent
score
23Cut Scores Guidelines
- Standard 4.19 of APA Test Standards (1999)
provide - Description/documentation of method used
- Training of judges
- Assessment of inter-rater agreement
- Set cutoff scores high enough to ensure that
minimum standards of performance are met!
24Cross-Validation
- Predictor-criterion relationships are often
operationalized using OLS regression - Weights are assigned to the predictors to
minimize the difference between observed and
predicted criterion scores - Prediction is optimized in the sample
- When weights derived in 1 sample are applied to a
2nd sample from same population, multiple R is
likely to shrink
25Cross-Validation (cont.)
- Key question Extent to which weights derived
from a sample cross-validate (generalize) - Can weights derived from one sample predict
outcomes to the same degree in other samples
drawn from same popn?
26Cross-Validation (cont.)
- Empirical and statistical strategies
- Empirical Fit a regression model in a sample
- Use the resulting regression weights with a 2nd,
independent sample - Multiple R in 2nd sample is used as the best
estimate of the cross-validated correlation - Single-sample strategy use one sample divide it
into derivation and cross-validation sub-samples
27Cross-Validation (cont.)
- Statistical cross-validation
- Adjust the sample-based multiple R for sample
size and of predictors - Jackknife method all-purpose statistical tool
- Draw 000s of random sub-samples with replacement
from the full original sample - Compute multiple R for each sub-sample
- Mean R is best estimate of the cross-validity
28Empirical Approaches
- Single-sample designs sacrifice df lead to
unstable regression weights - Multiple-sample designs require more time and
effort - They yield accurate estimates only if validation
sample represents the population, and is large,
relative to the number of predictors
29Statistical (Formula-Based) Approaches
- More cost-effective to implement
- Raju et al. (1999) investigated 11 cross-validity
estimation procedures - If N gt 40, most accurate formula is (Browne,
1975)
30Statistical Cross-Validation (cont.)
- Where ? population multiple correlation, N
sample size, k number of predictors - Squared multiple correlation in the population
(Wherry, 1931) -
- This equation is what most computer outputs label
adjusted R-squared - Its only an intermediate step
31Statistical Cross-Validation (cont.)
- Adjusts multiple R for sample size of
predictors - Does not address capitalization on chance in the
sample used - Must use the Browne formula in combination with
Wherry formula to address this issue
32Guidelines for Practice
- Demand cross-validation information before
deciding to use specific tests - Dont confuse adjusted R-squared with
cross-validity coefficient - Statistical approaches are as accurate as
empirical in most situations
33Overall Conclusions
- VG is as fallible as any other data-analysis
procedure, should not be the sole method of
defending an assessment procedure - Report significance levels statistical power
and confidence intervals - Criteria are dynamic and multi-dimensional
- Are you trying to predict typical or maximal
performance?
34Overall Conclusions
- If a cutoff score is necessary, it should reflect
minimum quals be based on a valid assessment
procedure - Every predictive study should reflect a
cross-validation estimate - Adjusted R-squared is only an intermediate step
in the process
35Science and Practice
- Our guidelines are sound technically, but
- Technically sound practices are sometimes not
adopted (Johns, 1993) - Technical justification ignores social and
contextual influences that affect adoption - Crises, politics, regulations, institutional
factors often overshadow technical merit
36Science and Practice
- Key criterion for practitioners (Muchinsky,
2004) - Organizational acceptability
- Scientists have much to learn from practitioners
about implementation - If both are willing even eager- to learn from
each other, then - We can reduce the gap that separates them!