Title: Some Concepts in Evidence Evaluation
1Some Concepts in Evidence Evaluation
- Robert J. Mislevy
- University of Maryland
- October 10, 2003
2Messick (1994) quote
- Begin by asking what complex of knowledge,
skills, or other attribute should be assessed... - Next, what behaviors or performances should
reveal those constructs and what tasks or
situations should elicit those behaviors? - Thus, the nature of the construct guides the
selection or construction of relevant tasks as
well as the rational development of
construct-based scoring criteria and rubrics.
3Evidence-centered design models
Task model includes specifications for work
product that will be captured
Task model includes specifications conditions
for performance
4Evidence-centered design models
Evaluation rules specify what observables are and
how they are determined from work product
Statistical portion of evidence model(s)
explicates how which observables depend on which
SM variables
5Key Concepts (1)
- Conceptual vs. Mechanical distinction
- Domain modeling vs. CAF
- Product vs. Process
- E.g., HYDRIVE, multiple choice, math problems
- High Inference vs. Low Inference
- High AP Studio Art
- Low Multiple-choice, carrying out DISC scoring
rules - Automated vs. Human Scoring
6Key Concepts (2)
- The role of rubrics
- Rubrics are instructions for humans
- The roles of examples
- Rubrics not enough for high inference evaluation
- Important to not only raters, but students
teachers - Importance of communicating the rules of the
game to the examinee - (Note relevance of sociocultural perspective)
7Rubrics for two observable variables in the BEAR
assessment Issues, Evidence, and You.Mislevy,
Wilson, Erkican, Chudowsky (2003)
Psychometric principles and student assessment
8What is performance assessment?
- The new kinds of tasks are distinguished from MC
tasks in a number of ways, some of which are
present in some so-called performance tasks but
not others (Wiley Haertel, p. 63) - More complex, longer to perform.
- Attempt to measure multiple, complex, integrated
knowledge capabilities. - Tasks nowhere near interchangeable. (Require
methods extracting multiple bits of evidence from
single performances and integrating them across
tasks into complex aggregates)
9Possible loci of interest
- Complex interactions between examinee
assessment? - Extended, multi-part activities? (NBPTS)
10Possible loci of interest
- Complex work product captured? (AP Art)
- Info about process as well as production
passed on to EI? (HYDRIVE)
11Possible loci of interest
- Complex process--more than objective
scoring--to evaluate work product? - Human judgment (AP Art), automated process
(Clauser et al. re NBME)? - Importance of washback effect (Frederiksen
Collins Wolf et al.)
12Possible loci of interest
- More than just right/wrong observable variables?
(AP Art rating scales) - Multiple aspects of complex performance captured?
- (language testing of speaking fluency
accuracy, which trade off)
13Possible loci of interest
- Multivariate student model, with different
aspects of skill - knowledge informed by different observables?
(WH emphasis - our examples include Hydrive DISC)
14The DISC Student Model
Student model variables-- of persisting interest
over multiple tasks
15The Statistical Part of a DISC Evidence Model
SM variables involved in scenarios written to
this task model
Variable to account for conditional dependence
among observables that are evaluations of aspects
from the same complex performance
Observables that evaluate key aspects of
performance in scenarios written from this task
model (human or automated)
16What does the DISC simulator presentation process
capture as work products?
- Examinees can (with varying degrees of accuracy
and completeness) - Choose procedures which provide information or
produce an observable effect - Provide rationales for actions as large
menu-driven faux insurance form - Identify important patient characteristics used
to guide treatment, again from large menu-driven
faux insurance form
17How is evidence evaluated given the examinees
performance?
Rules to evaluate essential characteristics of
examinee behavior
Example 1 Adequacy of examination procedures 1.
If the Rationale Product contains Chief
complaint Health history review THEN Adequacy
of history procedures performed all essential
history procedures ELSE If the
Rationale Product contains One, but not both,
essential procedure THEN Adequacy of history
procedures performed some essential history
procedures ELSE If the Rationale
Product contains Neither essential procedure
THEN Adequacy of history procedures did not
perform essential history procedures
18How is evidence evaluated given the examinees
performance?
Rules to evaluate essential characteristics of
examinee behavior
Example 2 Individualization of procedures 1. If
the Rationale Product contains Follow up
questions duration of canker sore when gums
bleed weight loss Dentition assessment
visual with mirror Periodontal assessment
visual with mirror THEN Individualization of
procedures Performed all essential
individualized procedures If the Rationale
Product contains 50-80 of individualized
procedures THEN Individualization of
procedures Performed some essential
individualized procedures If the Rationale
Product contains lt50 of individualized
procedures THEN Individualization of
procedures Did not perform essential
individualized procedures
19Docking an Evidence Model
Student Model
Evidence Model
20Wiley Haertelon designing scoring rubrics
- Deciding what skills abilities are to be
measured. - Deciding what aspects or subtasks of the task
bear on those abilities. - Assuring that the recording of performance
adequately reflects those aspects or subtasks
adequacy of work product. - Designing rubrics for those aspects or subtasks.
- Creating procedures for merging aspect and
subtask scores into a final set of scores
organized according to the skills or abilities
set forth as the intents of measurement.
(WH, 1996, p. 79)