Title: MOS 384a - Reliability and Validity
1MOS 384a - Reliability and Validity
- Overview
- Intro and some basic terms
- Basics of psychometric theory
- Reliability
- Validity
- Excursus Validity generalization
- Applicants perspective
2MOS 384a - Reliability and Validity
- Readings
- Textbook (CWHM) Chapter 2
- Present Slides , and Your Notes
3MOS 384a - Reliability and Validity
- How do you know how well your selection system
works? - Inappropriate ways to evaluate the system
- Trust in test publishers glossy brochures
- Review of test items/content based on common
sense - Anecdotal evidence
- Do as always did
4MOS 384a - Reliability and Validity
- OK, but how do you know it then?
- Recruitment and Selection as a System (see CWHM,
Fig. 2.1, p.25) - Constructing and evaluating a selection system
that works is a scientific process. - You develop hypotheses on what may work (based on
prior experience and theory) and - ...test them empirically.
- The empirical evaluation is called validation.
5MOS 384a - Reliability and Validity
- Some Basic Terms
- (Psychological) Construct an unobservable
quality that needs to be inferred from observable
measures - KSAO Constructs
- Knowledge can be declarative (facts) or
procedural (How to?) - Skills being able to perform a certain task
(manually and/or mentally) - Abilities More general/abstract constructs that
facilitate acquisition of knowledge/skills. - Other characteristics Personality traits,
attitudes, etc., that are not directly related to
cognitive or physical abilities but also
important for performance on the job.
6MOS 384a - Reliability and Validity
- Science- vs. Practice-Based Selection (see CWHM,
Table 2.1, p.28)
7MOS 384a - Reliability and Validity
- The Basics of Psychometrics
- If you observe the same thing in a group of
people, you get a distribution - Distributions are most easily described by a
central tendency (e.g., the arithmetic mean), and
the variation around it (e.g., the standard
deviation)
8MOS 384a - Reliability and Validity
- The Basics of Psychometrics
- If you observe more than one thing each of which
varies, there will be covariation (e.g., a
correlation r) between the variables
9MOS 384a - Reliability and Validity
- The Basics of Psychometrics
- KSAOs are often measured by psychological tests.
- Psychological tests are based on psychometric
theory. - (Classic) Psychometric theory involves a number
of assumptions or axioms - An observed test score (x) is composed of a true
score (t) and an error (e) x t e - There is nothing systematic in the error
component, which means that we expect errors to
cancel each other out the more often we measure
µ(e) 0 - The error component is NOT correlated with the
true score r(t,e) 0 - nor with the true score on other variables
r(t,e) 0 - nor with the error in a repeated measure
r(e,e) 0
10MOS 384a - Reliability and Validity
- The Basics of Psychometrics
- However, in reality, things are often slightly
more complicated. For example - Observed test scores (x) often reflect other
systematic components (t) in addition to the
true score (t) and random error (e) x t
t e - These additional systematic components can have
undesirable properties, such as being correlated
with the true score, not being cancelled out in
repeated measurements, etc.
11MOS 384a - Reliability and Validity
- Reliability The first concept of psychometric
quality - Definition Reliability is the degree to which a
test is free of random measurement error s2t /
(s2t s2e) - If we know the reliability, we know the extent to
which a test measures something - We still dont know the degree to which the test
measures what it should measure
12MOS 384a - Reliability and Validity
- Factors Affecting Reliability
- Temporary Individual Characteristics (e.g., mood,
physical or psychological well-being) - Lack of standardization (e.g., differing
conditions under which a test is administered,
differences between questions asked in an
interview) - Chance (e.g., guessing on a knowledge or
intelligence test, differences in prior
experience with a test) - Lack of comprehensiveness (e.g., too few items in
a test, inadequate scale format)
13MOS 384a - Reliability and Validity
- Methods of Estimating Reliability
- Test-Retest Reliability Administer the same test
twice and correlate the two measurements - Internal Consistency Take different parts
(items, halves) of the test and correlate them
with each other - Alternate Forms Construct two equivalent
versions of the same test and correlate them - Inter-Rater Reliability Let two or more persons
assess the same ratee on the same variables and
correlate the raters evaluations - In essence, all forms of reliability estimation
yield a tests correlation with itself rtt
14MOS 384a - Reliability and Validity
- Validity The core concept of psychometric
quality - Definition Validity is the degree to which
inferences or interpretations based on a test
score for a specific purpose are justified. - Validity is NOT a property of the test but of the
inferences based on the test - Reliability is a necessary but NOT a sufficient
precondition of validity.
15MOS 384a - Reliability and Validity
- Approaching Validity from Different Angles
- Validity is a unitary concept. There are no
multiple validities. - However, there are many different ways to
approach validity. Not all of them are equally
suitable in every instance. It all depends on the
inferences you wish to make
16MOS 384a - Reliability and Validity
- The Classic Distinction between 3 Validity
Concepts - Content validation draw inferences from a test
score to a larger domain of similar content - Emphasis on representativeness for the domain
- Established usually through expert ratings
- Example work sample test
17MOS 384a - Reliability and Validity
- The Classic Distinction between 3 Validity
Concepts - Construct validation draw inferences from a test
score to a psychological construct - Emphasis on relations between empirical
measurement and theoretical constructs - Established through a wide range of means. For
example, a test should correlate highly with
other measures of the same or similar constructs
(convergent validity) and low with measures of
conceptually distinct constructs (discriminant
validity) - Examples cognitive ability test, personality
test - NOTE CWHM are misleading on this issue. Evidence
of construct validity is often based on relations
to other variables.
18MOS 384a - Reliability and Validity
- The Classic Distinction between 3 Validity
Concepts - Criterion-related validation draw inferences
from a test score to outside variables - Emphasis on prediction of outside variables, in
personnel selection typically job performance - Established through criterion-related validation
studies. The criteria can be measured at the same
time as the predictor (concurrent validation) or
at a later point in time (predictive validation). - Examples important for any kind of selection
device. However, some instruments (e.g., some
kinds of biodata questionnaires) rely almost
exclusively on evidence of criterion-related
validity.
19MOS 384a - Reliability and Validity
- Factors Affecting Validity Coefficients
- Measurement error Reliability places an upper
limit on validity (rpt vrtt) - Range restriction If we employ job incumbents to
validate a selection procedure, the current
employees are likely to be more similar to each
other than the members of the original applicant
pool. - Sampling error Sample sizes in validation
studies are often so small that the empirical
coefficient becomes an imprecise estimate of the
actual population coefficient - Differences between the situation in the
validation study and the actual selection
situation - Flaws in criterion measurement
20MOS 384a - Excursus Validity Generalization
- VG Overview
- Why Doing VG and Other Metaanalyses?
- 1.1 Narrative Review vs. Metaanalysis
- 1.2 What does it mean to us?
- 2. Conducting a VG Study
- 2.1 Research Question
- 2.2 Literature Search
- 2.3 Coding Individual Studies
- 2.4 Computations
- 3. Some Critical Remarks
21MOS 384a - Excursus Validity Generalization
- Why doing metaanalyses?
- From narrative review to metaanalysis
- The idea of replication Two studies are better
than one (and three or more are even better). - There are often many studies on a topic. If so,
how to make sense of them overall? - 2 possible ways
- Combine them intuitively
- Combine them statistically
22MOS 384a - Excursus Validity Generalization
- Major Differences between Narrative and
Metaanalytic Review - Narrative Review Approach
- Intuitive and implicit weighting of study
outcomes or count of significant/non significant
results - -gt subjective summary
- Problem Real effects are often underestimated
because statistical artifacts are not taken into
account - Metaanalytic Approach
- Objective and quantitative weighting of study
outcomes - -gt quantitative summary of mean and variation of
effect sizes - Statistical artefacts are systematically
investigated
23MOS 384a - Excursus Validity Generalization
- Major Difference between VG and Single Validation
Study - VG based on (much) more comprehensive data
- VG delivers additional useful information
- Estimate of the true (population) value of
validity coefficients (?) - Estimate of the true size of the variation
around ? after correcting for statistical
artefacts (sampling error, and often also
measurement error, range restriction, etc.) - Helps to identify systematic sources of variation
across study findings (called moderators or
subgroups, often followed by moderator analyses)
24MOS 384a - Excursus Validity Generalization
- How a VG Study is Conducted
- Research Question
- Often not hypotheses testing but rather
exploratory (What is the validity of method X in
predicting job performance?) - Requires to exactly demarcate and structure the
field of research to end up with generalizable
results
25MOS 384a - Excursus Validity Generalization
- Literature Search
- Goal Find all studies that fit your research
question - Where to search
- Reference sections of prior narrative and
metaanalytic reviews - Keyword search (vary search terms!) in electronic
databases (e.g., PsycINFO, PSYNDEX, ABI/INFORM,
Sociofile, Dissertational Abstracts not Google) - Systematic manual search in relevant journals
- Contacting researchers, institutes, business
organizations with requests for unpublished
studies - Which studies to drop
- Missing information (but dont forget to ask the
authors!) - for conceptual reasons (lack of quality or
relevance for research question needs to be
substantiated!)
26MOS 384a - Excursus Validity Generalization
- Coding of Individual Studies
- Standard source, sample size, statistical
artefacts (if reported) - Assigment to subgroups according to planned
moderator analyses, for example - Sample characteristics (e.g., students vs.
managers) - Predictor characteristics (e.g., interview
structure) - Predictor characteristics (e.g., performance
ratings vs. objective indicators) - Design characteristics (e.g., predictive vs.
concurrent validation) - Source characteristics (e.g., published vs.
unpublished) - ...
27MOS 384a - Excursus Validity Generalization
- Computations
- (according to the VG method by Hunter Schmidt,
1990, 2005) - There are K correlation coefficients and N
persons entered into computations (important
correlations must be from independent samples
otherwise, compute a mean across correlations
within a single study) - It generally applies (all else being equal) that
the larger K and N, the more meaningful are the
results of a VG study
28MOS 384a - Excursus Validity Generalization
- Computational Steps
- mean uncorrected correlation ?ro ? Niroi / ?
Ni
29MOS 384a - Excursus Validity Generalization
- Computational Steps
- (2) Correcting for artefacts (each coefficient
individually) - Measurement error attenuation correction rxy
/ (rxxryy)½ Note correcting for attenuation in
the criterion only leads to an estimate of
operational (practical) validity correcting
both predictor and criterion for attenuation
estimates the true relationship between
constructs. If the reliability is not reported in
individual studies, it has to be estimated from
known information - Range restriction or enhancement Divide the
studys standard deviation (SD) by the population
SD Note Its often hard to estimate the
population SD - Compute the product of the individual corrections
per study - Divide each observed correlation by its
respective product of corrections (which is
usually lt 1, so rc gt ro).
30MOS 384a - Excursus Validity Generalization
- Computational Steps
- (3) Computing the mean corrected correlation
(true score correlation) - Each individually corrected correlation is
weighted with the product of its sample size
times the squared product of corrections (see
above i.e. larger samples and less flawed
studies receive larger weights) - Compute the true score correlation (estimate of
the population correlation ?) ? wirci / ? wi
31MOS 384a - Excursus Validity Generalization
- Computational Steps
- (4) Estimate the variance that remains after
correcting for artefacts - Compute the variance of the corrected
coefficients (Var(r)) ? wi (rci - ?)² / ?
wi - Compute the variance accounted for by artefacts
(Var(e)) It becomes larger the (a) smaller
individual samples are, (b) the larger the
artefacts are, (c) the smaller the observed
correlations are - Compute the variance not accounted for by
artefacts Var(r) - Var(e) (Note This
difference can become negative will then be
assumed to be zero)
32MOS 384a - Excursus Validity Generalization
- Computational Steps
- (5) Examine the generalizability of the mean
validity - (a) 75-Rule If at least 75 of the variance of
the corrected coefficients is accounted for by
artefacts (Var(e)/ Var(r) ? 0,75), the validity
is said to be generalizable. That is, there are
no substantial differences between the situations
in which the single validity coefficients were
observed - (b) Credibility interval (CV) If the 90-CV ( ?
1,64 (Var(r) - Var(e))½ ) does not include zero
or if it is relatively small, the validity is
said to be generalizable (Note The CV is not a
confidence interval, which could also be computed
and tells you about the accuracy of the estimate
for?ro) - ? If there is still substantial variance after
correcting for artefacts, this could be taken as
evidence of the existence of moderator variables
(subgroups with differing population means)
33MOS 384a - Excursus Validity Generalization
- Computational Steps
- (6) If Applicable Moderator Analyses
- Create subgroups of studies according to
previously specified criteria (participant
groups, predictor variants, criterion measures,
etc.) - Compute a new VG study for each group
individually - Decisive factor Does the variance not accounted
for by artefacts decrease substantially in
subgroup analyses? If so (cf. 75-Regel, CV), the
mean validity in each group can be interpreted as
this groups population value. - Problem Second order sampling error. Every
single moderator analysis contains fewer data
than the overall VG study. Therefore, values
found in moderator analyses are more prone to be
affected by atypical studies and less reliable.
34MOS 384a - Excursus Validity Generalization
- An Example (Hülsheger et al., in press)
35MOS 384a - Excursus Validity Generalization
- Some Critical Remarks on Metaanalysis
- (1) Publication Bias
- Effect sizes overstate true effects, because null
findings have lower chances of getting published - Plausible, but Empirical comparisons between
published and unpublished studies have often
shown negligible differences null findings can
be due to poor quality of the research - (2) Apples and Oranges-Problem
- Metaanalysts tend to lump together studies that
are hardly comparable - Maybe, but Metaanalysis provides you with the
means to uncover substantial differences between
studies and quantify them in moderator analyses
36MOS 384a - Excursus Validity Generalization
- (3) Lawnmower-Method
- Metaanalyses obscure the particularities of
individual studies by summarizing them all in a
single statistical value - Yes, but Metaanalysis is an alternative to the
narrative review, not to primary empirical
studies if you want to learn about the details
of a particular primary study, you have to go
back to the original source - (4) Over-Interpretation
- Metaanalyses are often considered to be the
final word on a subject, which can terminate
research interest in this issue - Can be true, but If so, it can be a blessing or
a curse the former if resources were otherwise
wasted on matters that can be closed, the latter
if conclusions based on metaanalyses turn out to
be wrong or deficient
37MOS 384a - Excursus Validity Generalization
- Conclusion
- VG and other methods of metaanalysis are powerful
tools for making sense of the often confusing
volume of apparently contradictory findings in
heavily researched fields. They are not to be
seen as machines that automatically generate the
truth about empirical questions.
38MOS 384a - Reliability and Validity
- Considering the Applicants Perspective Bias,
Fairness, and Acceptability - Bias Systematic errors in measurement related to
identifiable group membership characteristics
(e.g., sex, age, and many more) - Fairness The principle that every applicant
should be assessed in an equitable manner.
Fairness is based on judgment and often involves
processes of negotiation in a society/group. - Acceptability An applicants individual
perception of a selection procedure as being
fair, valid, useful, etc. Includes attitudinal
and behavioral reactions to being exposed to the
procedure (e.g., likelihood of accepting a job
offer).
39MOS 384a - Reliability and Validity
- According to organizational justice theory
(Gilliland, 1993), applicants accept selection if
it satisfies - Distributive Justice Selection decisions based
on accepted standards (merit, need, ) - Procedural Justice Adherence to rules of
structural (e.g., perceived validity, equal
administration), informational (communication
during and after process), and interpersonal
(respect, privacy) justice. - Discuss implications for practice!