Quality Control for Ratings in a Performance Test

1 / 46
About This Presentation
Title:

Quality Control for Ratings in a Performance Test

Description:

Guangdong University of Foreign Studies. Quality Control for Ratings in a Performance Test ... Bock, R. D. 1997. A brief history of item response theory [J] ... –

Number of Views:106
Avg rating:3.0/5.0
Slides: 47
Provided by: Micro335
Category:

less

Transcript and Presenter's Notes

Title: Quality Control for Ratings in a Performance Test


1
Quality Control for Ratings in a Performance Test
  • ??????????????????
  • ???????????????????????

2
Contents
3
Introduction
  • Performance assessment has resurfaced
    dramatically since the 1990s because it
  • shows promise for assessing learning outcomes
    that require demonstration of skills or other
    performances that cannot be assessed using
    multiple-choice
  • is linked with teaching and curriculum and
  • relates to real-life skill.

(Brown et al. 1996)
4
Theoretical Background
Performance-based assessment
INSTRUMENT/TASK
INTERLOCUTOR (including other candidates)
CANDIDATE
(McNamara, 1996 9)
5
Theoretical Background
Task Difficulty
Test Score
Domain Difficulty
Competence
Rater Effects
Rating Scale
Personal Characteristics


(Engelhard 1991)
6
Theoretical Background
7
Theoretical Background
Raters likes, dislikes, and expectations about
people
Fatigue or lapses in attention
Raters systematic variance unrelated to the
ratees performance
Rater biases
Use of the rating scale
Rater severity
Personal beliefs that conflict with the values
espoused by the scoring rubric
Deficiencies in some areas of content knowledge
8
Sources of rater bias and error
  • Pophams (1990) framework
  • Raters
  • Raters are engaging in a complex and error-prone
    cognitive process (Cronbach, 1990).
  • A rating involves an evaluative summary of past
    or present experiences in which the internal
    computer of the rater processes the input data in
    complex and unspecified ways to arrive at the
    final judgment (Thorndike Hagen, 1977)
  • Rating procedure
  • Too many traits
  • too long time fatigue, boredom
  • Computer-based vs. paper-based
  • Rating scales
  • Raters are not clear what they are being asked to
    rate.
  • The traits are not be clearly defined.
  • Rating scale categories are ambiguously worded or
    insufficiently differentiated (e.g., two ore more
    rating scale categories may overlap).

9
Theoretical Background
Myford Wolfe(2003 2004)
10
Rating evaluation - Two frameworks
  • Normative framework
  • Raters are examined in the context of the pool of
    raters from which individual raters are drawn
  • describes how much individual raters differ from
    the average rater in the pool
  • an agreement framework we are concerned with how
    well the ratings of individual raters agree with
    the ratings assigned by all of the other raters
    in the pool.

11
Rating evaluation - Two frameworks
  • Criterion-referenced framework
  • In the context of some external point of
    reference that is assumed to be a valid indicator
    of the examinees proficiency.
  • These externally generated scores are most
    commonly assigned by a benchmark committee, or
    determined based on examinee scores on some other
    assessment instrument.
  • Focus on errors, accuracy of a raters ratings

12
Ways of measuring rater effects
  • In general
  • Focusing on modeling the agreement among raters
  • Emphasizing the tasks of evaluating rater
    precision and estimating the relative rankings of
    individuals

(Johnson Albert, 1999)
13
Ways of measuring rater effects
  • Severity/leniency
  • Compare the mean ratings of the traits with the
    midpoints of the rating scales that were
    employed.
  • Use ANOVA to determine whether there is a
    statistically significant rater main effect.
  • Examine the degree of skewness of the frequency
    distributions of the ratings for the traits.

(Myford Wolfe, 2003)
14
Ways of measuring rater effects
  • Halo effect
  • Intercorrelations among ratings on a number of
    traits
  • Factor analyses of the trait intercorrelation
    matrix
  • Variances or standard deviations associated with
    each raters ratings of a given ratee across all
    the traits
  • ANOVA, focusing on Rater x Ratee interaction

(Myford Wolfe, 2003)
15
Ways of measuring rater effects
  • Centrality/Restriction-of-Range
  • Compare average rating for a given trait with the
    midpoint of the rating scale.
  • Examine SD of the ratings across all ratees for a
    given trait (the smaller the SD, the greater the
    restriction-of-range effect).
  • Look at the degree of kurtosis of the frequency
    distribution for the ratings assigned on a given
    trait.
  • Conduct a Rater x Ratee x Trait ANOVA

(Myford Wolfe, 2003)
16
Ways of measuring rater effects
  • Accuracy/Randomness
  • Compare raters estimate of an ratees
    proficiency with the ratees actual proficiency.
  • Generalizability theory
  • Many-facet Rasch Analysis

(Myford Wolfe, 2003)
17
Many-facet Rasch Measurement
  • The MFRM model takes this basic form
  • log (Pnjik/Pnjik-1 ) Bn Di Cj Fjk
  • where
  • Pnjik is the probability of student n receiving a
    rating of k on item i from rater j
  • Pnjik-1is the probability of student n receiving
    a rating of k-1 on item i from rater j
  • Bn is the ability of student n
  • Di is the difficulty of item i
  • Cj is the severity of rater j, and
  • Fjk is the difficulty of receiving a rating of k
    relative to a rating of k-1 on item i.
  • The MFRM model calibrates the level of the test
    takers performance, severity of the raters,
    difficulty of the tasks, and the rating scales
    onto a logit scale, creating a single frame of
    reference for interpreting the results of the
    analysis.
  • The various facets are analyzed simultaneously
    but independently, which makes it possible to
    measure rater severity on the same scale as
    students performance and trait difficulty.
  • We can thus draw useful, diagnostically
    informative comparisons among the various facets.

18
Ways to prevent rater errors
  • Rating supervisors screen raters.
  • Rater training
  • Raters are provided with info about common rater
    errors.
  • Supervisors read behind raters.
  • Supervisors periodically receive statistical
    info.

19
Literature review
  • The most frequently examined rater effect is
    rater severity.
  • Bachman et al. (1995 ) developed a performance
    assessment measure of language speaking ability,
    the Language Ability Assessment System, and
    examine its reliability through generalizability
    theory and the many-facet Rasch theory. Test
    takers read passages and view recorded lectures
    in the target language, answering questions in
    writing and orally, and summarizing the lectures
    orally. An analysis of the scores, raters, and
    the test itself found that individual raters and
    tasks could affect the estimation of a
    test-taker's language ability, and that a large
    range of severity existed in the raters
    judgments of performance.
  • Kondon-Brown (2002 ) used many-facet Rasch theory
    to investigate how judgments of trained teacher
    raters are biased towards certain types of
    candidates and certain criteria in assessing
    Japanese L2 writing. The study revealed
    significant differences in overall severity. The
    raters scored certain candidates and criteria
    more leniently or harshly, and every raters bias
    pattern was different.
  • Similar rater severity effect was also found in
    some other studies (e.g., Engelhard, 1994
    Lumley, 1995 Gyagenda, 1998 Eckes, 2005).

20
Literature review
  • Relationship between raters, domains, and
    students language ability.
  • Gyagenda Engelhard (1998) examined whether
    there were statistically significant differences
    in rater severity and domain difficulty, and to
    explore the rater by domain interaction effect.
    Results indicated significant differences between
    raters, between domains, and a significant
    interaction effect between raters and domains.
  • Eckes(2008) advanced the hypothesis that
    experienced raters fall into types or classes
    that are clearly distinguishable from one another
    with respect to the importance they attach to
    scoring criteria. Many-facet Rasch analysis
    revealed that raters differed significantly in
    their views on the importance of the various
    criteria. Six rater types emerged from the
    analysis. Each of these types was characterized
    by a distinct scoring profile, indicating that
    raters were far from dividing their attention
    evenly among the set of criteria. Moreover, rater
    background variables were shown to partially
    account for the scoring profile differences.

21
Literature review
  • Effect of rater training in reducing rater
    effects.
  • Weigle (1998) conducted a study to explore
    differences in rater severity and consistency
    among inexperienced and experienced raters both
    before and after rater training. Analysis showed
    that the inexperienced raters tended to be both
    more severe and less consistent in their ratings
    than the experienced raters before training.
    After training, the differences between the two
    groups of raters were less pronounced however,
    significant differences in severity were still
    found among raters, although consistency had
    improved for most raters, suggesting that rater
    training is more successful in improving
    intra-rater reliability than in getting raters to
    have better inter-rater reliability.
  • Lumley McNamara (1995) investigated the effect
    of rater training in a speaking test. Data are
    presented from two rater training sessions
    separated by a 6-month interval and an
    intervening operational test administration
    session. Analyses showed that training could by
    no means eliminate the substantial variation in
    rater harshness, and suggested that the results
    of training may not endure for long after a
    training session.

22
Literature review
  • ???(2006)?????????????????????,??????????,????????
    ?????????????????????????????????????,??????,?????
    ?????????,?????,?????????????????????????????,???
    ??????????????
  • ???(2006)?????????????????????????????????????????
    ???????,???????????,??????????????,???????????????
    ??????????????????????,????????????????????
  • ??????(2008)????Rasch ????????????????????????????
    ??,??????????????????????????????????,????????????
  • ??(2008)?100?????????????????????????????,????????
    ??????????, ?????????????,???????????????
  • ??(2009)?????????????????????????????Rasch????????
    ????????????????,???????????????,?????????????????
    ??????

23
Research questions
  • Do raters differ in the levels of severity they
    exercise in their ratings and if so, which
    raters are rating more severely or leniently than
    the others?
  • Do raters effectively and consistently employ the
    rating scales in their ratings?
  • Do raters efficiently differentiate between
    traits, that is, do raters show any evidence of
    halo effect?
  • Do raters exhibit bias in their ratings?
    Specifically, is rater severity invariant across
    the test takers?

24
Experimental design
25
Results and discussion
Facet map
26
Rater reliability
Rater statistics
27
Severity/Leniency
Rater statistics
  • Separation index7.71, showing that the severity
    of the raters can be divided into about 8 levels
    formula (4G1)/3,G is the separation ratio.

(Myford Wolfe 2004)
28
Differential severity/leniency
Rater x trait statistics
??????????????????????????????,???????????????????
??????????????????????????????????????????????????
z???2,???????????????????????????z???-2,?????????
?????????????????
29
Accuracy/Randomness
Ratee statistics
  • ?Rasch???,???????????????????,???????????,????????
    ???????????
  • ????????,?????????????????????????????????????????
    ?,???????????????????????
  • ?????????????????????????????????????????????,????
    ??????????????????????????????????

30
Accuracy/Randomness
Rater statistics
???????????????????????????????????,??????????????
?????,????????
31
Centrality/Restriction-of-Range
????????????????????,?????????????????????????????
????????????????,?????????????????????????
32
Centrality/Restriction-of-Range
Ratee statistics
????????,??????????????????
33
Centrality/Restriction-of-Range
Trait statistics
  • ?????????????????????????????????????(???2SD)(McN
    amara 1996),????????1??????????????????,??????????
    ?????????(Myford Wolfe 2004)?
  • ???????????????(.84,Z-2.2)????1,?????????????????
    ?????

34
Centrality/Restriction-of-Range
????????,???48???3??????,????3????????????
35
Centrality/Restriction-of-Range
  • ??????????????4???????
  • ????,?????????????
  • ???????????????????
  • ???????????
  • ???????????????????,??????????
  • ???,????Rater 4???????????????

36
Centrality/Restriction-of-Range
Category statistics
  • ??????????,?????????(threshold)??????,????????????
    ??????????????????????????????????1 logit,?????4
    (Linacre 1999) .

37
Centrality/Restriction-of-Range
38
Halo effect
Category statistics
  • ????????????????????????(???????????????),????????
    ????????????????
  • ?????????????????????????????????,????????????????
    ???????

39
Halo effect
Rater statistics
  • ??????????????,?????????????,??????????????????1,?
    ????????????????,??????????????????
  • ????,???????????????,????????,?????????????????,??
    ??????????1?

40
Conclusion
  • ???????????????????,??????????????????????????????
    ???????????(Eckes 2008)?
  • ???Rasch???????????????????????????????,??????????
    ?????
  • ?????,????????????,???????????????????????????????
    ?,?Rater 4?????????????????????????,???????????

41
Implications
  • ?????Rasch?????????????????????????????????????,??
    ???????,??????????????????,?????????Rasch??,??????
    ??????,?????????,?????????????????,??????????????,
    ?????????????
  • ?????????????????????(Barrett 2001 Bonk Ockey
    2003 Elder et al. 2005),????????????????????,????
    ??(LeBel et al. Forthcoming Saito
    2008),?????????
  • ????????????(????????????)?????????????(Barrett
    2001),???????????,??????????????(Eckes 2008)?
  • ??????????????????????????????????????????????????
    ?????,????????????,?????????????????????(LeBel et
    al. Forthcoming),????????????????????????????
  • ??,?????????????????,????(??)??,??????????,???????
    ?,????????????????????????????????????????????????
    (e.g. Lumley McNamara 1995 Wolfe, Moudler
    Myford 2001),????????????????????????(Myford
    Wolfe 2004),??????????,???????????????

42
References
  • Bachman, L. F., Lynch, B. K. Mason, M. 1995.
    Investigating variability in tasks and rater
    judgments in a performance test of foreign
    language speaking J. Language Testing 12
    238-257.
  • Barrett, S. 2001. The impact of training on rater
    variability J. International Education Journal
    2 49-58.
  • Bernardin, H. J. Pence, E. C. 1980. Effects of
    rater training Creating new response sets and
    decreasing accuracy J. Journal of Applied
    Psychology 65(60-66).
  • Bock, R. D. 1997. A brief history of item
    response theory J. Educational Measurement
    Issues and Practice 16 21-33.
  • Bonk, W. J. Ockey, G. J. 2003. A many-facet
    Rasch analysis of the second language group oral
    discussion task J. Language Testing 20(1)
    89-110.
  • Brown, W. L., O'Gorman, K., Du, Y. 1996. The
    reliability and validity of mathematics
    performance assessment P. Paper presented at
    the Annual Meeting of the American Educational
    Research Association, Minnesota
  • Buu, Y.-P. 2003. Statistical analysis of rater
    effects D. Unpublished PhD thesis, University
    of Florida, Florida.
  • Cronbach, L. J. 1990. Essentials of Psychological
    Testing M (5th ed.). New York Haper and Row.
  • Eckes, T. 2005. Examining rater effects in
    TestDaf writing and speaking performance
    assessments A many-facet Rasch analysis J.
    Language Assessment Quarterly 2(3) 197-221.
  • Eckes, T. 2008. Rater types in writing
    performance assessments A classification
    approach to rater variability J. Language
    Testing 25 155-185.
  • Elder, C., Knoch, U., Barkhuizen, G. von
    Randow, J. 2005. Individual feedback to enhance
    rater training Does it work? J. Language
    Assessment Quarterly 2 175-196.
  • Engelhard, G., Jr. 1992. The measurement of
    writing ability with a many-faceted rasch model
    J. Applied Measurement in Education 5(3)
    171-191.
  • Engelhard, G., Jr. 1994. Examining rater errors
    in the assessment of written composition with a
    many-faceted Rasch model J. Journal of
    Educational Measurement 31(2) 93-112.

43
References
  • Gyagenda, I. S. Engelhard, G., Jr. 1998.
    Applying the Rasch model to explore rater
    influences on the assessed quality of students'
    writing ability P. Paper presented at the
    Annual Meeting of the American Educational
    Research Association, San Diego.
  • Hedge, J. W., Kavanagh, M. J. 1988. Improving
    the accuracy of performance evaluations
    Comparison of three methods of performance
    appraiser training J. Journal of Applied
    Psychology 73 68-73.
  • Johnson, V. E. Albert, J. H. 1999. Ordinal Data
    Modeling M. New York Springer-Verlag.
  • Kenny, D. A., Kashy, D. A. 1992. Analysis of
    the multitrait-multimethod matrix by confirmatory
    factor analysis J. Psychological Bulletin 112
    165-172.
  • Kondo-Brown, K. 2002. A facets analysis of rater
    bias in measuring Japanese second language
    writing performance J. Language Testing 19(1)
    3-31.
  • Kumar, D. D. 2005. Performance appraisal The
    importance of rater training J. Journal of the
    Kuala Lumpur Royal Malaysia Police College 4
    1-17.
  • LeBel, T. J., Kilgus, S. P., Briesch, A. M.
    Chafouleas, S. Forthcoming. The impact of
    training on the accuracy of teacher-completed
    direct behavior ratings J. Journal of Positive
    Behavior Interventions.
  • Linacre, J. M. 1994. Many-facet Rasch Measurement
    M. Chicago MESA Press.
  • Linacre, J. M. 1999. Investigating rating scale
    category utility J. Journal of Outcome
    Measurement 3(2) 103-122.
  • Linacre, J. M. 2003. A User's Guide to Facets
    Rasch-model Computer Program M. Chicago MESA
    Press.

44
References
  • Lumley, T. McNamara, T. F. 1995. Rater
    characteristics and rater bias Implications for
    training J. Language Testing 12(1) 54-71.
  • Lumley, T. Brown, A. 2005. Research methods in
    language testing A. In E. Hinkel (Ed.),
    Handbook of Research in Second Language Teaching
    and Learning C. Mahwah, NJ Lawrence Erlbaum,
    833-855.
  • Lynch, B. K. McNamara, T. 1998. Using g-theory
    and many-facet Rasch measurement in the
    development of performance assessments of the ESL
    speaking skills of immigrants J. Language
    Testing 15(2) 158-180.
  • McNamara, T. 1996. Measuring Second Language
    Performance M. London New York Longman.
  • Murphy, K. R. Anhalt, R. L. 1992. Is halo error
    a property of the rater, ratees, or the specific
    behavior observed? J. Journal of Applied
    Psychology 77 494-500.
  • Myford, C. M. Mislevy, R. J. 1995. Monitoring
    and improving a portfolio assessment system R
    (No. MS 94-05). Princeton, NJ Educational
    Testing Service.
  • Myford, C. M. Wolfe, E. W. 2003. Detecting and
    measuring rater effects using many-facet Rasch
    measurement Part I J. Journal of Applied
    Measurement 4(4) 386-422.
  • Myford, C. M. Wolfe, E. W. 2004. Understanding
    Rasch measurement Detecting and measuring rater
    effects using many-facet Rasch measurement Part
    II J. Journal of Applied Measurement 5(2)
    189-227.
  • Popham, W.J. 1990. Modern Educational
    Measurement A Practitioners Perspective.
    Englewood Cliffs, NJ Prentice Hall.
  • Saito, H. 2008. EFL classroom peer assessment
    Training effects on rating and commenting J.
    Language Testing 25(4) 553-581.
  • Scullen, S. E., Mount, M. K., Goff, M. 2000.
    Understanding the latent structure of job
    performance ratings J. Journal of Applied
    Psychology 85 956-970.
  • Sykes, R. C., Ito, K. Wang, Z. 2008. Effects of
    assigning raters to items J. Educational
    Measurement Issues Practice 27 47-55.

45
References
  • Thorndike, R. L. Hagen, E.P. 1977. Measurement
    and Evaluation in Psychology and Education. New
    York John Wiley and Sons.
  • Weigle, S. C. 1998. Using facets to model rater
    training effects J. Language Testing 15(2)
    263-287.
  • Wolfe, E. W. Chiu, C. W. T. 1997. Detecting
    rater effects with a multi-faceted rating scale
    model P. Paper presented at the Annual Meeting
    of the National Council on Measurement in
    Education, Chicago.
  • Wolfe, E. W. Chiu, C. W. T. 1999. Measuring
    pretest-posttest change with a Rasch rating scale
    model J. Journal of Outcome Measurement 3(2)
    134-161.
  • Wolfe, E. W., Moudler, B. M. Myford, C. M.
    2001. Detecting differential rater functioning
    over time (drift) using a Rasch multi-faceted
    rating scale model J. Journal of Applied
    Measurement 2 256-280.
  • Zhu, W., Ennis, C. D. Ang, C. 1998.
    Many-faceted Rasch modeling expert judgment in
    test development J. Measurement in Physical
    Education Exercise Science 2(1) 21-39.
  • ??? ??,2008,???Rasch ?????????????????(CET-SET)?
    ??? J????? (4) 388-398?
  • ???,2006,??????????????????????? D????????,
    ????????, ???
  • ??,2008,???Rasch????????????? D????????, ??????
    ???
  • ???,2006,??????????????? D,???????, ????, ???
  • ??,2009,?????????????????????? D????????,
    ????????, ???

46
Thank You !
Email ahda_liu_at_yahoo.com.cn Website
www.clal.org.cn/personal/jackliu
Write a Comment
User Comments (0)
About PowerShow.com