Title: Sevres
1- Sevres Munich benchmarking conferences some
personal observations - Dr Neil Jones
- ALTE Conference Cardiff November 2005
2A report on the Sevres analysis
- Is available at http//www.coe.int/T/E/Cultural_C
o-operation/education/Languages/Language_Policy/Co
mmon_Framework_of_Reference/SevresreportNJ.pdf - (Follow links for CEFR then click on
Illustrations of the European levels of language
proficiency, - Then on Report on analysis of rating data)
- The following questions are addressed
- 1. What is the best estimate of the CEFR level of
each extract? - 2. How well do raters agree in their ratings?
- 3. What is the effect of plenary discussion on
the extent of agreement? - 4. How do raters understand and use the rating
criteria? - 5. Does agreement improve over time?
- 6. Do rater groups perform differently?
3A benchmarking conference
True ability
Shared construct
CEFR
Learnersranked on shared criteria
and their CEFRlevelsagreed
Own constructs
Sample of performance
Raters
4A benchmarking conference
True ability
Shared construct
CEFR
Learnersranked on shared criteria
and their CEFRlevelsagreed
Own constructs
Sample of performance
Raters
5A benchmarking conference
True ability
Shared construct
CEFR
Learnersranked on shared criteria
and their CEFRlevelsagreed
Own constructs
Sample of performance
Guidance
Raters
6Issues
- What formats can work well for benchmarking?
- How can/should a benchmarking exemplar differ
from a standardisation video for a given exam? - Which parts of the CEFR reference scales are
most useful for rating? - Is developing a shared understanding of the
construct exactly the same thing as standardising
raters agreement on CEFR levels?
7Using the CEFR scales evidence from analysis of
data
- The speaking assessment criteria are not
differentiated. - A generalizability study from Sevres
- Munich data not yet analysed
8Generalizability study (Rating criteria are not
differentiated)
Rating criteria are Range, Accuracy, Fluency,
Interaction, Coherence, Global rating
9So what are raters really doing?
- First form an overall impression of the level.
- (this was the procedure adopted in Munich)
- Then look at the criteria to confirm/rationalise
the decision. - The criteria are generally not concrete enough to
differentiate between specific performances. - Yes, raters do judge some criteria more harshly
than others
10Relative difficulty of rating criteria
11Relative difficulty of rating criteria
- Raters do judge some criteria more harshly than
others, but they do the same for everybody! - Munich data not analysed but seemed very similar
in this respect - Should the accuracy scale be adjusted down and
the fluency/interaction scales adjusted up? - Perhaps this would not help penalizing error and
rewarding communication is part of feeling
comfortable about our overall decision.
12Focus on salience
- When we form an overall impression of the level
of a performance, what are we focussing on? - The salient features of the level what
distinguishes it from a higher or lower level - (an exercise based on this in Munich)
- Fluency is a scale with one point on it described
as fluent. - My attempt at a minimal level description
13NJs minimal table of salient features
14NJs even shorter list
15Do all rating criteria have the same status?
- Range appears to be linked to the tasks one can
do it is almost a definition of a
functionally-defined proficiency scale. - (Hence the problems with format, if the task is
below a subjects level.) - Interaction, coherence are dependent on the task
you cant demonstrate more than the task
demands. - Fluency, accuracy are inversely related to the
demands of a task. - But there may be a trade-off between fluency and
accuracy.
16Simple model of speaking performance
Subject atsame levelas taskeven profile
17Simple model of speaking performance
Subject athigher levelthan taskcant show
true range, interaction, coherence but good
accuracy, fluency
18Simple model of speaking performance
Subject below levelof taskshows true range,
interaction, coherence, gives poor impression of
accuracy, fluency
19Simple model of speaking performance
But may manage greater accuracy at expense of
fluency
20Simple model of speaking performance
Or vice-versa
21My personal opinion
- One should aim for a shared understanding of the
construct, which includes a shared awareness of
how rating works. - Spending more time comparing performances (rather
than rating them) would help. - Its vital to grasp the salient features of a
level. - CEFR has much useful text about this.
- Detailed study of the text of the rating criteria
may not be the best way of standardising
perceptions.
22(No Transcript)
23Some other analysis results
- Degree of agreement before and after discussion
- Performance by rater group
24Agreement before and after discussion
25Performance by rater group