Title: How to Assess and Measure Competency
1How to Assess andMeasure Competency
- Robert C. Shaw, Jr., PhD
- Program Director
2Presentation Outline
- Describe a programs responsibilities
- Assess appropriate content
- Measure abilities as precisely as possible
- Reference each cut score to a criterion
3The validity claim
- Our program is confident we can make valid
inferences from an assessment because - we carefully selected and structured the content
- and
- observed scores are reasonably precise
- Weakness in either claim diminishes the validity
argument
4Define appropriate content
5Information sources for content
Certification Boards Expectations
6What should we assess?
- A program should seek multiple opinions about
program content - May mean more than one faculty person in the
program - Could extend to survey results from several
stakeholders - Those who hire your graduates
- Those who graduated
7Describe potential content
- Define potential content by describing job
behaviors or tasks - Interpret ABG results
- Determine the appropriate time to refer a patient
for consultation from another service - Adjust mechanical ventilation settings to
optimize oxygenation for a patient while
minimizing the risk of pulmonary injury
8Define terminal behaviors
- Focus terminal assessments on end-product
behavior you expect students to master - Insert a pulmonary artery catheter in a patient
within a critical care setting using standard
technique while minimizing risks of infection and
lung involvement - Integrate pulmonary function testing results with
patient history and other laboratory results to
produce a diagnosis
9Measure task criticality
- Typically expressed by the interaction of a
- importance/significance/risk measure
- and a
- frequency/extent measure
10Potential survey measurements
- How important is the task to success?
- OR
- How significant is the task to safe and effective
practice?
- 4Extremely
- 3Very
- 2Moderately
- 1Minimally
11Potential survey measurements
- If this task is incorrectly performed, how strong
is the risk?
- 3 Potentially fatal
- 2Likely to increase morbidity
- 1 Unlikely to have an adverse effect
12Potential survey measurements
- How frequently do you perform the task?
- 3Every week
- 2A few times each year
- 1Less than once a year
- 3Very often
- 2Occasionally
- 1Infrequently
13Potential survey measurements
- Have you performed the task in the last year?
14What can we do with task measurements?
- Normed-referenced approach
- Rank order tasks from most to least critical
- Start at the top and work down using available
time - Criterion-referenced approach
- Identify tasks that are sufficiently critical to
ensure program coverage and competency assessment
15Select item type(s) for each assessment
- Constructed response (e.g., short answer, essay,
performance) - Short development time
- Long scoring time
- Scores have strong subjective characteristics
- Selected response (e.g., true/false, matching,
multiple-choice) - Long development time
- Short scoring time
- Scores have strong objective characteristics
16High stakes terminal assessments should be
standardized
- Specify how the assessment should look before
writing/selecting items - Test specifications ensure each assessment is
similar, fair, and covers critical content
17Test specifications are typically two-dimensional
18Entire test blueprint/matrix
19Test specifications and items
- Each item should be linked to a task and a
cognitive process level - It helps to store items in a database
- A sophisticated database will permit additional
layers of classification - Acute/chronic care
- Age groups
20Item banking software
- FastTest
- www.assess.com/frmSoftCat.htm
- ExamView
- www.pearsonncs.com/examview/
- examview.htm
- LXRTest
- www.lxrtest.com/
21Measure abilities precisely
- Are we confident an assessment has yielded a
sufficiently precise ability estimate?
22Reliability
- Theoretical premise
- Observed scores are assumed to express true
ability plus some measurement error - High reliability implies low measurement error
23Reliability
- Reliability indices are R2 values, which express
the percentage of observed score variance that
can be attributed to true score variance - How high is high enough?
- A test score reliability value of at least .85 is
a characteristic of large-scale, standardized
assessments, many exceed .90 - Sufficiently reliable test scores from a test
built by a program should show values of at least
.60
24Reliability
- Reliability is an attribute of a set of test
scores, it is not an attribute of a test - Therefore, a program should assess reliability
for each group - KR20 is appropriate for dichotomously scored
(0,1) items - Coefficient alpha works for polytomously (0,
1,n) scored items
25Why are selected response items used for so many
assessments?
- Assuming the time to assess is constant, more
responses can be elicited from students using
selected response items - more items
- broader content coverage
- increased information
- enhanced measurement precision
- stronger validity
- Scores are more strongly objective
26Add items or options?
- A program cannot go wrong by adding more items to
an assessment - A program may only consume space and time by
adding more options to multiple-choice items - There is growing evidence items with 3 options
are optimal, particularly when doing so permits
inclusion of more items on an assessment - Dr. Thomas Haladyna, Arizona State University
27Up to a point, measurement precision and item
quantity are directly related
Reliability
Higher quality items
Lower quality items
Item Count
28What encourages high item quality?
- Write well
- Clear, concise, accurate
- Remove unnecessary information from the stimulus
- Present nuanced choices that require a
sophisticated mastery of material to correctly
respond - Item review is another opportunity to seek
multiple opinions
29What encourages high item quality?
- Avoid formats known to be flawed
- D. All of the above
- D. None of the above
- Negative wording
- All of the following are true EXCEPT
- Which of the following is not true?
30What encourages high item quality?
- Apply quality improvement principles
- Analyze item performance
- Retain items that contribute to test score
reliability - Change or discard items that fail to contribute
or negatively affect reliability
31Item analysis properties
- Difficulty
- p proportion of students who correctly
responded - Discrimination
- rpb correlation between item success and
students test scores
32Item difficulty
Contribution to Test Score Reliability
1.0
0.0
0.4
0.6
p
33Item discrimination
- Because rpb values are correlations, values
reflect one of three possibilities relative to
reliability - Positive contribution
- No contribution
- Negative contribution
34Using item parameters diagnostically
- Relative to reliability contribution, item
- p values provide magnitude information
- rpb values provide magnitude and direction ( or
-) information
35Using item parameters diagnostically
- Difficulty and discrimination properties equally
contribute to reliability - The best items show .30ltplt.70 AND ppbgt.20
- The worst items exist at the difficulty extremes
and show zero or negative discrimination
36After diagnosing an item that shows a weak or
negative reliability contribution
- What should we do?
- Observe option response frequencies and mean
scores - Identify incorrect responses that attracted
students with test scores equal to or greater
than the average - Replace the offending option with a less
attractive response - Rewrite the stem to clarify ambiguities
- OR
- Discard the whole item and use a better one the
next time
37Item analysis software
- Iteman
- www.assess.com/Software/iteman.htm
- examSystem II
- www.pearsonncs.com/examsystem/index.htm
- LXRTest
- www.lxrtest.com/
- True Score II
- www.nine-patch.com/TSCDL.htm
- Excel Templates Free
- www.eflclub.com/elvin/publications/2003/itemanalys
is.html
38Internal resources may be available
- There is a good probability a large university
with education, psychology, and/or statistics
departments will have a system available for
scoring items and providing analyses of test
scores and items
39Reference each cut score to a criterion
- Should we define and assess minimal competence
for our program?
40Cut points
- Highly reliable test scores reveal differences
between students abilities and can help
accurately rank order students, which may be
important to employers - However, the program is likely interested in
assessing whether each student is sufficiently
competent to safely and effectively practice - Such assessment concerns typically surface as
students are about to graduate
41Measuring minimal competence
- A program should decide whether it wants to
create one large assessment with a single
compensatory cut point - OR
- Should each content domain have its own cut, a
conjunctive model
42Why are there so many compensatory cut competency
assessments?
- If a program selects the more rigorous
conjunctive model, then each component test will
produce its own set of scores, each with its own
reliability - Each component must have a sufficient number of
items or data points to be confident each student
groups test scores will show adequate
reliability - Modules of less than 80-100 program-made items
are unlikely to produce adequate reliability
43Seek multiple opinions . . . again
- Program faculty should define skills competent
practitioners possess - This is a group activity
- Each cut point should be linked to a definition
of minimally competent practitioners
44Performance assessments
- Pick your spots
- Ensure a sufficient quantity of information is
collected - Standardize administration
- Measure agreement between/among evaluators
45Summary
- Collective opinions are closer to the truth about
- appropriate assessment content,
- item quality, and
- justifiable cut scores than any one opinion
- Unreliable scales have no utility
46Thank you for the opportunity to share some
details about measurement