Test Development and Use: New Twists on Old Questions - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Test Development and Use: New Twists on Old Questions

Description:

Test Development and Use: New Twists on Old Questions. Wayne F. Cascio. PTSC, October 21, 2005. Source: Cascio, W. F., & Aguinis, H. (2005, Fall) ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 37
Provided by: waynec8
Category:

less

Transcript and Presenter's Notes

Title: Test Development and Use: New Twists on Old Questions


1
Test Development and Use New Twists on Old
Questions
  • Wayne F. Cascio
  • PTSC, October 21, 2005

2
Source
  • Cascio, W. F., Aguinis, H. (2005, Fall). Test
    development and use New twists on old questions.
    Human Resource Management, 44(3), 219-235

3
Purpose
  • Update research findings in 5 key areas
  • Validity generalization
  • Statistical significance testing
  • Criterion measures
  • Cutoff scores
  • Cross-validation
  • Distill guidelines for practice
  • Each area is controversial, but
  • General agreement about guidelines

4
Validity Generalization
  • Variability across studies in validities might
    not represent genuine differences
  • Of 7 potential sources of variability, sampling
    error is most important
  • Lesson the mean of several validity coefficients
    is a better basis for inferring validity than is
    any one coefficient (SIOP, 2003)

5
Legal Status of VG
  • Only 3 cases that relied on VG evidence have
    reached the Appeals-Court level
  • Courts do not always accept VG evidence
  • Atlas Paper Box Co. (1989)
  • Atlas did no job analyses
  • Expert never visited the company
  • Expert argued that measures of G are always valid
  • 6th Circuit noted As a matter of law, Hunters
    VG theory is totally unacceptable under relevant
    case law and profl standards.

6
VG Guidelines for Practice
  • Provide evidence that jobs and contexts are
    similar to those in a VG study
  • Dont rely on VG as the sole basis for defending
    a test, if challenged
  • Does the VG study describe clearly the predictors
    and criteria being assessed?
  • Does the VG study include all publicly available
    studies in the domain of interest?
  • Are the variables selected/coded based on a
    priori theoretical grounds?

7
VG Guidelines for Practice
  • Did multiple raters apply the coding scheme?
    Inter-rater reliabilities?
  • Does the VG study include all variables analyzed,
    including moderators?
  • Are the characteristics of studies included
    reported in as much detail as possible, so
    readers can assess generalizations that are
    appropriate?

8
Statistical Significance Testing
  • There is heated debate over the use of null
    hypothesis significance testing
  • Two practical issues
  • How significance tests are used
  • Are significance tests alone sufficient to
    justify using a test, or is additional
    information needed?

9
Use of Statistical Significance Testing
  • Allows us to infer if a null hypothesis of no
    test-criterion relationship is likely to be false
  • Used incorrectly when
  • A result at the .01 level is interpreted as a
    larger effect than one at the .05 level
  • Failure to reject the n.h. in the sample is
    interpreted as evidence of lack of relationship
    in the population

10
Guidelines for Practice
  • Report confidence intervals significance levels
    shows the practical significance of the effect
  • Observed vs. corrected validities
  • Use observed when the objective is to understand
    validity evidence for a specific
    predictor-criterion relationship
  • Use corrected when objective is to understand
    relationships among constructs

11
Guidelines for Practice (cont.)
  • Be very clear about what significance tests
    provide and what they do not provide (strength of
    effect)
  • Can low correlations still be useful? Yes
    utility depends on SDy, SR, and cost of test
  • Compute statistical power to rule out possibility
    that sample was too small to detect an effect
    that actually was present (Type II error)

12
Measures of Performance (Criteria)
  • Primary standard for choosing a criterion
    Relevance
  • Cant get useful information if criteria are
    deficient or contaminated
  • Other aspects of criteria
  • Dynamism
  • Typical vs. maximum performance
  • Multi-dimensionality

13
Dynamism of Criteria
  • Does the rank order of individuals on a criterion
    change over time?
  • Keil Cortina (2001) deterioration of validity
    over time is a ubiquitous phenomenon
  • Guidelines for practice
  • ID and understand variables that cause
    performance to change over time
  • Tests of general vs. narrow ability tend to
    predict criteria for longer periods of time
  • But, changes in complexity and consistency of
    tasks over time reduce validity

14
Dynamism of Criteria - Guidelines
  • In some organizations e.g., services,
    fast-paced work environments - task performance
    is likely to change over time, while contextual
    performance tends to remain stable
  • Lesson predictors of contextual performance,
    more so than task performance, are likely to be
    valid longer

15
Typical Versus Maximum Performance
  • Measures of max perf. what employees can do)
    correlate only slightly with measures of typical
    perf. what employees will do.
  • Do the objectives of a validation study focus on
    typical or maximum performance?
  • Guidelines for practice
  • Selection procedures ? Maximum perf.
  • Criteria ? Typical perf.

16
Typical v. Maximum Performance
  • Lack of alignment in predictor-criterion
  • constructs may prevent development of tests with
    high predictive validity
  • The focus of a validation study should be
    determined, in part, by whether it includes a
    measure of typical or maximum performance as a
    criterion

17
Multi-Dimensionality of Criteria
  • Since job performance is multi-dimensional
  • Criterion measures also should be
  • Two-dimensional taxonomy
  • Task performance
  • Contextual performance
  • Guidelines for practice
  • Include both types as criteria
  • Example work environments where technology
    requires constant learning of new tools in a
    cooperative context work in teams

18
Cutoff Scores
  • Sometimes not needed (e.g., top-down selection)
  • State or local governing bodies might require
    them for licensure, certification, promotion, or
    graduation
  • Setting minimum standards
  • Multiple-choice format
  • Angoff procedure

19
Cutoff Scores (cont.)
  • Constructed-response tests
  • Analytical judgment method (AJM)
  • Panelists given examinee work samples for each
    question
  • Ratings for each question below basic, basic,
    proficient, advanced
  • W/in category ratings low, middle, high
  • Discussion to resolve discrepancies
  • Re-rating

20
Cutoff Scores (cont.)
  • AJM To calculate a performance standard for
    proficient
  • Work samples rated as low, middle, and high are
    used to calculate an average score (performance
    standard)
  • Process is repeated for basic and advanced
    standards, and across all questions
  • Total assessment stds. sum over standards
    (basic, proficient, advanced) for all questions

21
Cutoff Scores (cont.)
  • AJM has been piloted in 3 states
  • Easy to use, results in cut scores panelists see
    as appropriate
  • Need more validity data to examine accuracy of
    performance standards

22
Cut Scores Guidelines
  • There is no single best method for all situations
  • ID critical levels of proficiency on KSAOs
  • (e.g., using AJM)
  • Use a 10-20 sample of SMEs with a broad
    cross-section of experience
  • Consider errors of measurement and A.I.
  • E.g., set cut score 1SEM below mean incumbent
    score

23
Cut Scores Guidelines
  • Standard 4.19 of APA Test Standards (1999)
    provide
  • Description/documentation of method used
  • Training of judges
  • Assessment of inter-rater agreement
  • Set cutoff scores high enough to ensure that
    minimum standards of performance are met!

24
Cross-Validation
  • Predictor-criterion relationships are often
    operationalized using OLS regression
  • Weights are assigned to the predictors to
    minimize the difference between observed and
    predicted criterion scores
  • Prediction is optimized in the sample
  • When weights derived in 1 sample are applied to a
    2nd sample from same population, multiple R is
    likely to shrink

25
Cross-Validation (cont.)
  • Key question Extent to which weights derived
    from a sample cross-validate (generalize)
  • Can weights derived from one sample predict
    outcomes to the same degree in other samples
    drawn from same popn?

26
Cross-Validation (cont.)
  • Empirical and statistical strategies
  • Empirical Fit a regression model in a sample
  • Use the resulting regression weights with a 2nd,
    independent sample
  • Multiple R in 2nd sample is used as the best
    estimate of the cross-validated correlation
  • Single-sample strategy use one sample divide it
    into derivation and cross-validation sub-samples

27
Cross-Validation (cont.)
  • Statistical cross-validation
  • Adjust the sample-based multiple R for sample
    size and of predictors
  • Jackknife method all-purpose statistical tool
  • Draw 000s of random sub-samples with replacement
    from the full original sample
  • Compute multiple R for each sub-sample
  • Mean R is best estimate of the cross-validity

28
Empirical Approaches
  • Single-sample designs sacrifice df lead to
    unstable regression weights
  • Multiple-sample designs require more time and
    effort
  • They yield accurate estimates only if validation
    sample represents the population, and is large,
    relative to the number of predictors

29
Statistical (Formula-Based) Approaches
  • More cost-effective to implement
  • Raju et al. (1999) investigated 11 cross-validity
    estimation procedures
  • If N gt 40, most accurate formula is (Browne,
    1975)

30
Statistical Cross-Validation (cont.)
  • Where ? population multiple correlation, N
    sample size, k number of predictors
  • Squared multiple correlation in the population
    (Wherry, 1931)
  • This equation is what most computer outputs label
    adjusted R-squared
  • Its only an intermediate step

31
Statistical Cross-Validation (cont.)
  • Adjusts multiple R for sample size of
    predictors
  • Does not address capitalization on chance in the
    sample used
  • Must use the Browne formula in combination with
    Wherry formula to address this issue

32
Guidelines for Practice
  • Demand cross-validation information before
    deciding to use specific tests
  • Dont confuse adjusted R-squared with
    cross-validity coefficient
  • Statistical approaches are as accurate as
    empirical in most situations

33
Overall Conclusions
  • VG is as fallible as any other data-analysis
    procedure, should not be the sole method of
    defending an assessment procedure
  • Report significance levels statistical power
    and confidence intervals
  • Criteria are dynamic and multi-dimensional
  • Are you trying to predict typical or maximal
    performance?

34
Overall Conclusions
  • If a cutoff score is necessary, it should reflect
    minimum quals be based on a valid assessment
    procedure
  • Every predictive study should reflect a
    cross-validation estimate
  • Adjusted R-squared is only an intermediate step
    in the process

35
Science and Practice
  • Our guidelines are sound technically, but
  • Technically sound practices are sometimes not
    adopted (Johns, 1993)
  • Technical justification ignores social and
    contextual influences that affect adoption
  • Crises, politics, regulations, institutional
    factors often overshadow technical merit

36
Science and Practice
  • Key criterion for practitioners (Muchinsky,
    2004)
  • Organizational acceptability
  • Scientists have much to learn from practitioners
    about implementation
  • If both are willing even eager- to learn from
    each other, then
  • We can reduce the gap that separates them!
Write a Comment
User Comments (0)
About PowerShow.com