Title: Quality Control for Ratings in a Performance Test
1Quality Control for Ratings in a Performance Test
- ??????????????????
- ???????????????????????
2Contents
3Introduction
- Performance assessment has resurfaced
dramatically since the 1990s because it - shows promise for assessing learning outcomes
that require demonstration of skills or other
performances that cannot be assessed using
multiple-choice - is linked with teaching and curriculum and
- relates to real-life skill.
(Brown et al. 1996)
4Theoretical Background
Performance-based assessment
INSTRUMENT/TASK
INTERLOCUTOR (including other candidates)
CANDIDATE
(McNamara, 1996 9)
5Theoretical Background
Task Difficulty
Test Score
Domain Difficulty
Competence
Rater Effects
Rating Scale
Personal Characteristics
(Engelhard 1991)
6Theoretical Background
7Theoretical Background
Raters likes, dislikes, and expectations about
people
Fatigue or lapses in attention
Raters systematic variance unrelated to the
ratees performance
Rater biases
Use of the rating scale
Rater severity
Personal beliefs that conflict with the values
espoused by the scoring rubric
Deficiencies in some areas of content knowledge
8Sources of rater bias and error
- Pophams (1990) framework
- Raters
- Raters are engaging in a complex and error-prone
cognitive process (Cronbach, 1990). - A rating involves an evaluative summary of past
or present experiences in which the internal
computer of the rater processes the input data in
complex and unspecified ways to arrive at the
final judgment (Thorndike Hagen, 1977) - Rating procedure
- Too many traits
- too long time fatigue, boredom
- Computer-based vs. paper-based
- Rating scales
- Raters are not clear what they are being asked to
rate. - The traits are not be clearly defined.
- Rating scale categories are ambiguously worded or
insufficiently differentiated (e.g., two ore more
rating scale categories may overlap).
9Theoretical Background
Myford Wolfe(2003 2004)
10Rating evaluation - Two frameworks
- Normative framework
- Raters are examined in the context of the pool of
raters from which individual raters are drawn - describes how much individual raters differ from
the average rater in the pool - an agreement framework we are concerned with how
well the ratings of individual raters agree with
the ratings assigned by all of the other raters
in the pool.
11Rating evaluation - Two frameworks
- Criterion-referenced framework
- In the context of some external point of
reference that is assumed to be a valid indicator
of the examinees proficiency. - These externally generated scores are most
commonly assigned by a benchmark committee, or
determined based on examinee scores on some other
assessment instrument. - Focus on errors, accuracy of a raters ratings
12Ways of measuring rater effects
- In general
- Focusing on modeling the agreement among raters
- Emphasizing the tasks of evaluating rater
precision and estimating the relative rankings of
individuals
(Johnson Albert, 1999)
13Ways of measuring rater effects
- Severity/leniency
- Compare the mean ratings of the traits with the
midpoints of the rating scales that were
employed. - Use ANOVA to determine whether there is a
statistically significant rater main effect. - Examine the degree of skewness of the frequency
distributions of the ratings for the traits.
(Myford Wolfe, 2003)
14Ways of measuring rater effects
- Halo effect
- Intercorrelations among ratings on a number of
traits - Factor analyses of the trait intercorrelation
matrix - Variances or standard deviations associated with
each raters ratings of a given ratee across all
the traits - ANOVA, focusing on Rater x Ratee interaction
(Myford Wolfe, 2003)
15Ways of measuring rater effects
- Centrality/Restriction-of-Range
- Compare average rating for a given trait with the
midpoint of the rating scale. - Examine SD of the ratings across all ratees for a
given trait (the smaller the SD, the greater the
restriction-of-range effect). - Look at the degree of kurtosis of the frequency
distribution for the ratings assigned on a given
trait. - Conduct a Rater x Ratee x Trait ANOVA
(Myford Wolfe, 2003)
16Ways of measuring rater effects
- Accuracy/Randomness
- Compare raters estimate of an ratees
proficiency with the ratees actual proficiency. - Generalizability theory
- Many-facet Rasch Analysis
(Myford Wolfe, 2003)
17Many-facet Rasch Measurement
- The MFRM model takes this basic form
- log (Pnjik/Pnjik-1 ) Bn Di Cj Fjk
- where
- Pnjik is the probability of student n receiving a
rating of k on item i from rater j - Pnjik-1is the probability of student n receiving
a rating of k-1 on item i from rater j - Bn is the ability of student n
- Di is the difficulty of item i
- Cj is the severity of rater j, and
- Fjk is the difficulty of receiving a rating of k
relative to a rating of k-1 on item i. - The MFRM model calibrates the level of the test
takers performance, severity of the raters,
difficulty of the tasks, and the rating scales
onto a logit scale, creating a single frame of
reference for interpreting the results of the
analysis. - The various facets are analyzed simultaneously
but independently, which makes it possible to
measure rater severity on the same scale as
students performance and trait difficulty. - We can thus draw useful, diagnostically
informative comparisons among the various facets.
18Ways to prevent rater errors
- Rating supervisors screen raters.
- Rater training
- Raters are provided with info about common rater
errors. - Supervisors read behind raters.
- Supervisors periodically receive statistical
info.
19Literature review
- The most frequently examined rater effect is
rater severity. - Bachman et al. (1995 ) developed a performance
assessment measure of language speaking ability,
the Language Ability Assessment System, and
examine its reliability through generalizability
theory and the many-facet Rasch theory. Test
takers read passages and view recorded lectures
in the target language, answering questions in
writing and orally, and summarizing the lectures
orally. An analysis of the scores, raters, and
the test itself found that individual raters and
tasks could affect the estimation of a
test-taker's language ability, and that a large
range of severity existed in the raters
judgments of performance. - Kondon-Brown (2002 ) used many-facet Rasch theory
to investigate how judgments of trained teacher
raters are biased towards certain types of
candidates and certain criteria in assessing
Japanese L2 writing. The study revealed
significant differences in overall severity. The
raters scored certain candidates and criteria
more leniently or harshly, and every raters bias
pattern was different. - Similar rater severity effect was also found in
some other studies (e.g., Engelhard, 1994
Lumley, 1995 Gyagenda, 1998 Eckes, 2005).
20Literature review
- Relationship between raters, domains, and
students language ability. - Gyagenda Engelhard (1998) examined whether
there were statistically significant differences
in rater severity and domain difficulty, and to
explore the rater by domain interaction effect.
Results indicated significant differences between
raters, between domains, and a significant
interaction effect between raters and domains. - Eckes(2008) advanced the hypothesis that
experienced raters fall into types or classes
that are clearly distinguishable from one another
with respect to the importance they attach to
scoring criteria. Many-facet Rasch analysis
revealed that raters differed significantly in
their views on the importance of the various
criteria. Six rater types emerged from the
analysis. Each of these types was characterized
by a distinct scoring profile, indicating that
raters were far from dividing their attention
evenly among the set of criteria. Moreover, rater
background variables were shown to partially
account for the scoring profile differences.
21Literature review
- Effect of rater training in reducing rater
effects. - Weigle (1998) conducted a study to explore
differences in rater severity and consistency
among inexperienced and experienced raters both
before and after rater training. Analysis showed
that the inexperienced raters tended to be both
more severe and less consistent in their ratings
than the experienced raters before training.
After training, the differences between the two
groups of raters were less pronounced however,
significant differences in severity were still
found among raters, although consistency had
improved for most raters, suggesting that rater
training is more successful in improving
intra-rater reliability than in getting raters to
have better inter-rater reliability. - Lumley McNamara (1995) investigated the effect
of rater training in a speaking test. Data are
presented from two rater training sessions
separated by a 6-month interval and an
intervening operational test administration
session. Analyses showed that training could by
no means eliminate the substantial variation in
rater harshness, and suggested that the results
of training may not endure for long after a
training session.
22Literature review
- ???(2006)?????????????????????,??????????,????????
?????????????????????????????????????,??????,?????
?????????,?????,?????????????????????????????,???
?????????????? - ???(2006)?????????????????????????????????????????
???????,???????????,??????????????,???????????????
??????????????????????,???????????????????? - ??????(2008)????Rasch ????????????????????????????
??,??????????????????????????????????,????????????
- ??(2008)?100?????????????????????????????,????????
??????????, ?????????????,??????????????? - ??(2009)?????????????????????????????Rasch????????
????????????????,???????????????,?????????????????
??????
23Research questions
- Do raters differ in the levels of severity they
exercise in their ratings and if so, which
raters are rating more severely or leniently than
the others? - Do raters effectively and consistently employ the
rating scales in their ratings? - Do raters efficiently differentiate between
traits, that is, do raters show any evidence of
halo effect? - Do raters exhibit bias in their ratings?
Specifically, is rater severity invariant across
the test takers?
24Experimental design
25Results and discussion
Facet map
26Rater reliability
Rater statistics
27Severity/Leniency
Rater statistics
- Separation index7.71, showing that the severity
of the raters can be divided into about 8 levels
formula (4G1)/3,G is the separation ratio.
(Myford Wolfe 2004)
28Differential severity/leniency
Rater x trait statistics
??????????????????????????????,???????????????????
??????????????????????????????????????????????????
z???2,???????????????????????????z???-2,?????????
?????????????????
29Accuracy/Randomness
Ratee statistics
- ?Rasch???,???????????????????,???????????,????????
??????????? - ????????,?????????????????????????????????????????
?,??????????????????????? - ?????????????????????????????????????????????,????
??????????????????????????????????
30Accuracy/Randomness
Rater statistics
???????????????????????????????????,??????????????
?????,????????
31Centrality/Restriction-of-Range
????????????????????,?????????????????????????????
????????????????,?????????????????????????
32Centrality/Restriction-of-Range
Ratee statistics
????????,??????????????????
33Centrality/Restriction-of-Range
Trait statistics
- ?????????????????????????????????????(???2SD)(McN
amara 1996),????????1??????????????????,??????????
?????????(Myford Wolfe 2004)? - ???????????????(.84,Z-2.2)????1,?????????????????
?????
34Centrality/Restriction-of-Range
????????,???48???3??????,????3????????????
35Centrality/Restriction-of-Range
- ??????????????4???????
- ????,?????????????
- ???????????????????
- ???????????
- ???????????????????,??????????
- ???,????Rater 4???????????????
36Centrality/Restriction-of-Range
Category statistics
- ??????????,?????????(threshold)??????,????????????
??????????????????????????????????1 logit,?????4
(Linacre 1999) .
37Centrality/Restriction-of-Range
38Halo effect
Category statistics
- ????????????????????????(???????????????),????????
???????????????? - ?????????????????????????????????,????????????????
???????
39Halo effect
Rater statistics
- ??????????????,?????????????,??????????????????1,?
????????????????,?????????????????? - ????,???????????????,????????,?????????????????,??
??????????1?
40Conclusion
- ???????????????????,??????????????????????????????
???????????(Eckes 2008)? - ???Rasch???????????????????????????????,??????????
????? - ?????,????????????,???????????????????????????????
?,?Rater 4?????????????????????????,???????????
41Implications
- ?????Rasch?????????????????????????????????????,??
???????,??????????????????,?????????Rasch??,??????
??????,?????????,?????????????????,??????????????,
????????????? - ?????????????????????(Barrett 2001 Bonk Ockey
2003 Elder et al. 2005),????????????????????,????
??(LeBel et al. Forthcoming Saito
2008),????????? - ????????????(????????????)?????????????(Barrett
2001),???????????,??????????????(Eckes 2008)? - ??????????????????????????????????????????????????
?????,????????????,?????????????????????(LeBel et
al. Forthcoming),???????????????????????????? - ??,?????????????????,????(??)??,??????????,???????
?,????????????????????????????????????????????????
(e.g. Lumley McNamara 1995 Wolfe, Moudler
Myford 2001),????????????????????????(Myford
Wolfe 2004),??????????,???????????????
42References
- Bachman, L. F., Lynch, B. K. Mason, M. 1995.
Investigating variability in tasks and rater
judgments in a performance test of foreign
language speaking J. Language Testing 12
238-257. - Barrett, S. 2001. The impact of training on rater
variability J. International Education Journal
2 49-58. - Bernardin, H. J. Pence, E. C. 1980. Effects of
rater training Creating new response sets and
decreasing accuracy J. Journal of Applied
Psychology 65(60-66). - Bock, R. D. 1997. A brief history of item
response theory J. Educational Measurement
Issues and Practice 16 21-33. - Bonk, W. J. Ockey, G. J. 2003. A many-facet
Rasch analysis of the second language group oral
discussion task J. Language Testing 20(1)
89-110. - Brown, W. L., O'Gorman, K., Du, Y. 1996. The
reliability and validity of mathematics
performance assessment P. Paper presented at
the Annual Meeting of the American Educational
Research Association, Minnesota - Buu, Y.-P. 2003. Statistical analysis of rater
effects D. Unpublished PhD thesis, University
of Florida, Florida. - Cronbach, L. J. 1990. Essentials of Psychological
Testing M (5th ed.). New York Haper and Row. - Eckes, T. 2005. Examining rater effects in
TestDaf writing and speaking performance
assessments A many-facet Rasch analysis J.
Language Assessment Quarterly 2(3) 197-221. - Eckes, T. 2008. Rater types in writing
performance assessments A classification
approach to rater variability J. Language
Testing 25 155-185. - Elder, C., Knoch, U., Barkhuizen, G. von
Randow, J. 2005. Individual feedback to enhance
rater training Does it work? J. Language
Assessment Quarterly 2 175-196. - Engelhard, G., Jr. 1992. The measurement of
writing ability with a many-faceted rasch model
J. Applied Measurement in Education 5(3)
171-191. - Engelhard, G., Jr. 1994. Examining rater errors
in the assessment of written composition with a
many-faceted Rasch model J. Journal of
Educational Measurement 31(2) 93-112.
43References
- Gyagenda, I. S. Engelhard, G., Jr. 1998.
Applying the Rasch model to explore rater
influences on the assessed quality of students'
writing ability P. Paper presented at the
Annual Meeting of the American Educational
Research Association, San Diego. - Hedge, J. W., Kavanagh, M. J. 1988. Improving
the accuracy of performance evaluations
Comparison of three methods of performance
appraiser training J. Journal of Applied
Psychology 73 68-73. - Johnson, V. E. Albert, J. H. 1999. Ordinal Data
Modeling M. New York Springer-Verlag. - Kenny, D. A., Kashy, D. A. 1992. Analysis of
the multitrait-multimethod matrix by confirmatory
factor analysis J. Psychological Bulletin 112
165-172. - Kondo-Brown, K. 2002. A facets analysis of rater
bias in measuring Japanese second language
writing performance J. Language Testing 19(1)
3-31. - Kumar, D. D. 2005. Performance appraisal The
importance of rater training J. Journal of the
Kuala Lumpur Royal Malaysia Police College 4
1-17. - LeBel, T. J., Kilgus, S. P., Briesch, A. M.
Chafouleas, S. Forthcoming. The impact of
training on the accuracy of teacher-completed
direct behavior ratings J. Journal of Positive
Behavior Interventions. - Linacre, J. M. 1994. Many-facet Rasch Measurement
M. Chicago MESA Press. - Linacre, J. M. 1999. Investigating rating scale
category utility J. Journal of Outcome
Measurement 3(2) 103-122. - Linacre, J. M. 2003. A User's Guide to Facets
Rasch-model Computer Program M. Chicago MESA
Press.
44References
- Lumley, T. McNamara, T. F. 1995. Rater
characteristics and rater bias Implications for
training J. Language Testing 12(1) 54-71. - Lumley, T. Brown, A. 2005. Research methods in
language testing A. In E. Hinkel (Ed.),
Handbook of Research in Second Language Teaching
and Learning C. Mahwah, NJ Lawrence Erlbaum,
833-855. - Lynch, B. K. McNamara, T. 1998. Using g-theory
and many-facet Rasch measurement in the
development of performance assessments of the ESL
speaking skills of immigrants J. Language
Testing 15(2) 158-180. - McNamara, T. 1996. Measuring Second Language
Performance M. London New York Longman. - Murphy, K. R. Anhalt, R. L. 1992. Is halo error
a property of the rater, ratees, or the specific
behavior observed? J. Journal of Applied
Psychology 77 494-500. - Myford, C. M. Mislevy, R. J. 1995. Monitoring
and improving a portfolio assessment system R
(No. MS 94-05). Princeton, NJ Educational
Testing Service. - Myford, C. M. Wolfe, E. W. 2003. Detecting and
measuring rater effects using many-facet Rasch
measurement Part I J. Journal of Applied
Measurement 4(4) 386-422. - Myford, C. M. Wolfe, E. W. 2004. Understanding
Rasch measurement Detecting and measuring rater
effects using many-facet Rasch measurement Part
II J. Journal of Applied Measurement 5(2)
189-227. - Popham, W.J. 1990. Modern Educational
Measurement A Practitioners Perspective.
Englewood Cliffs, NJ Prentice Hall. - Saito, H. 2008. EFL classroom peer assessment
Training effects on rating and commenting J.
Language Testing 25(4) 553-581. - Scullen, S. E., Mount, M. K., Goff, M. 2000.
Understanding the latent structure of job
performance ratings J. Journal of Applied
Psychology 85 956-970. - Sykes, R. C., Ito, K. Wang, Z. 2008. Effects of
assigning raters to items J. Educational
Measurement Issues Practice 27 47-55.
45References
- Thorndike, R. L. Hagen, E.P. 1977. Measurement
and Evaluation in Psychology and Education. New
York John Wiley and Sons. - Weigle, S. C. 1998. Using facets to model rater
training effects J. Language Testing 15(2)
263-287. - Wolfe, E. W. Chiu, C. W. T. 1997. Detecting
rater effects with a multi-faceted rating scale
model P. Paper presented at the Annual Meeting
of the National Council on Measurement in
Education, Chicago. - Wolfe, E. W. Chiu, C. W. T. 1999. Measuring
pretest-posttest change with a Rasch rating scale
model J. Journal of Outcome Measurement 3(2)
134-161. - Wolfe, E. W., Moudler, B. M. Myford, C. M.
2001. Detecting differential rater functioning
over time (drift) using a Rasch multi-faceted
rating scale model J. Journal of Applied
Measurement 2 256-280. - Zhu, W., Ennis, C. D. Ang, C. 1998.
Many-faceted Rasch modeling expert judgment in
test development J. Measurement in Physical
Education Exercise Science 2(1) 21-39. - ??? ??,2008,???Rasch ?????????????????(CET-SET)?
??? J????? (4) 388-398? - ???,2006,??????????????????????? D????????,
????????, ??? - ??,2008,???Rasch????????????? D????????, ??????
??? - ???,2006,??????????????? D,???????, ????, ???
- ??,2009,?????????????????????? D????????,
????????, ???
46Thank You !
Email ahda_liu_at_yahoo.com.cn Website
www.clal.org.cn/personal/jackliu