Title: Test tasks for speaking
1Test tasks for speaking balancing between
authenticity and reliability
- Raili Hildén, University of Helsinki, Finland
- Raili.hilden_at_helsinki.fi
- TBLT 2009Lancaster
- Tasks context, purpose and use 3rd Biennial
International Conference on Task-Based Language
Teaching - 13-16 September 2009
2Background Hy-talk project of speaking
assessment
- The project is funded by the University of
Helsinki - To validate the illustrative scales of speaking
included in the national core curricula for
general education and upper secondary level by
trialing a prototype test of speaking. - Subscales overall task completion, fluency,
pronunciation, range and accuracy is empirically
aligned to relevant scales of the CEFR. - http//blogs.helsinki.fi/hy-talk/
3The conceptual framework
- Validity argumentation scheme for interpretation
of the HY-Talk project data (adapted from Kane,
2001, Fulcher Davidson, 2007, 164 174
Bachman, 2005) - The claim to be probed
- The illustrative scales of descriptors of oral
proficiency included in the national core
curricula for language education enable
sufficiently valid conclusions on students oral
proficiency in general school education in
Finland.
4The purpose of the HY-Talk study
- The validity claim is supported and challenged by
warrants and rebuttals regarding - relevance
- utility
- (Intended consequences)
- sufficiency
5Warrants
- The tasks used to elicit student performance
correspond to pedagogic tasks and target language
use tasks of students at the age of general
education. (utility) - Reliability of assessments based on the scale and
the tasks to elicit performances is found to be
high enough. (sufficiency)
6Backing to support the utility claim
- Rater and test taker feedback confirm the
perceived authenticity of the tasks and
appropriateness of administration. - The level ratings correspond to the target levels
in the curricula.
7Backing data to support the sufficiency claim
- Statistical reliability evidence confirm
sufficient level of consistency across raters,
tasks and languages, and interlocutors.
8Counterclaims
- The tasks used to elicit student performance
correspond inadequately to pedagogic tasks or TLU
tasks of students. (utility) - The link to the scale descriptors may be weak.
(utility) - The level assignments do not match the target
levels set in the curricula. - Reliability of assessments is not stable, but
varies too much across tasks, raters or
languages, or is caused by intervening variables
or inadequate evidence base. (sufficiency)
9Reubuttal data to support the utility claim
- Statistical evidence challenge the intended
utility of the tasks. - Verbal data from students and teachers question
the utility and/or sufficiency of the tasks for
the purpose.
10Research questions
- 1. How is the inter-rater reliability of the
judgements? - 2. How are the tasks and corresponding salient
task features related to target level judgements,
assessment criteria and their combination?
(numeric data, analysed with Facets) - 3. How are the tasks perceived by students and
raters? (verbal data based on feedback sheets and
audio recorded rating sessions)
11Speaking Tasks
- Tasks were designed to reflect the average target
level specified for good mastery of the syllabus - English (grade 7 A1.3, grade 1 A2.2)
- German etc. (grade 7 A1.2, grade 1 A2.1)
- They also draw on the thematic content of the
curricula - Discussed, revised and piloted by the project
group
12Prototype tasks (with examples)
- 1. Presentation (A2.2) partly controlled
monologue - 2. Everyday life (A2.1 A2.2) rigidly controlled
dialogues - At the airport, grade 7
- At home, grade 7
- Accommodation, grade 1
- On the way home, grade 1
- 3. Negotiation partly controlled idalogue
Planning an outing (A2.1 B1.1)
13Speaking Tasks
- Prompts in L1
- Time on task 10-15 min,
- Conducted in pairs
- Rated by 5-10 language experts
14Data of this study
- Speech samples in English (56)
- Speech samples in German (66)
15Facets examined in this study
- Raters (5 English, 7 German)
- Tasks 1-4
- Task dimensions
- Overall task performance
- Fluency
- Pronunciation
- Range
- Accuracy
16Results RQ1 english samplesoverall inter-rater
agreement
- Majority of total ratings were placed between
levels 5-6 (CEFR A2-B1) - Across all facets the raters the distance between
the most severe and the most lenient rater was 1
logit (levels 5/6) - Average of ratings given by R4 6.66
- Average of ratings given by R1 5.87
- For more detailed record please contact the
author.
17Results RQ1 english samplesoverall task
difficulty
- The easiest task
- Presentation was assigned the highest fair
average of 6.29 - The trickiest task
- Everyday life task Accommodation was assigned
the lowest fair average of 6.21 - For more detailed record please contact the
author.
18Results RQ1 english samplescriteria
- The easiest criterion
- Pronunciation (fair average 6.39)
- The trickiest criterion
- Range (fair average 6.02)
- For more detailed record please contact the
author.
19Results RQ1 english samplescombined difficulty
taskcriteria
- The easiest combination
- Presentation Accuracy
- Presentation Fluency
- The trickiest combination
- Everyday situation Accommodation Range
- For more detailed record please contact the
author.
20Results RQ1 german samplesoverall inter-rater
agreement
- Majority of total ratings were placed between
levels 5-6/10 (CEFR A2-B1) - Across all facets and raters, the distance
between the most severe and the most lenient
rater was 1 logit (levels 5-6) - Average of ratings given by R6 (3.96/10)
- Average of ratings given by R2 (3.57/10)
- For more detailed record please contact the
author.
21Results RQ1 german samplesoverall task
difficulty
- The easiest task
- Presentation task was assigned the highest fair
average of 4.21/10 - The trickiest task
- Everyday life task On the way home was
assigned the lowest fair average of 3.57/10 - For more detailed record please contact the
author.
22Results RQ1 german samplescriteria
- The easiest criterion Pronunciation 4.24/10
(fair average ) - The trickiest criterion Range 3.49/10
- (fair average )
- For more detailed record please contact the
author.
23Results RQ1 german samplescombined difficulty
taskcriteria
- The easiest combination
- Presentation Pronunciation (level 6B1.1)
- The trickiest combination
- Negotiation (Planning an outing) Range (level 5
A2.2 lower band) - For more detailed record please contact the
author.
24Rq2 english german
- The tasks were conceived as authentic in regard
to themes and situations - Authenticity (Bachman Palmer, 1996) was
questioned by raters during the sessions due to
the high grade of control regulated by the L1
prompts (to increase reliability) - Students regarded the tasks as relevant and
highly probable in real life. - The raters of German discussed the interlocutor
impact of the pair setting as a biasing factor. - The results suggest that the target level
requirements set in the Finnish curricula are
attained reasonably well.
25discussion
- Utility claim was confirmed as to the high level
of agreement of raters across facets
(reliability) - Sufficiency and relevance were partly questioned
due to the claimed unauthenticity of the task
(rigor of instructions) - How to go about the dilemma in the future
versions of the test?
26references
- Bachman. L.F. (2005). Building and supporting a
case for test use. Language Assessment Quarterly,
2(1), 134. - Fulcher, G. Davidson, F. (2007). Language
Testing and Assessment. An advanced resource
book. Abington New York Routledge. - Hildén, R. Takala, S. 2007. Relating
Descriptors of the Finnish School Scale to the
CEF Overall Scales for Communicative Activities.
Teoksessa Koskensalo, A., Smeds, J., Kaikkonen,
P. Kohonen, V. (toim.) Foreign languages and
multicultural perspectives in the European
context Fremdsprachen und multikulturelle
Perspektiven im europäischen Kontext. Dichtung,
Wahrheit und Sprache (ss. 73 88). LIT-Verlag.
27bibliography
- National Core Curriculum for the Comprehensive
School 2004. Helsinki Finnish National Board of
Education. In Finnish http//www.oph.fi/info/ops/ - National Core Curriculum for the Upper Secondary
Level 2003. Helsinki Finnish National Board of
Education. In Finnish - http//www.oph.fi/pageLast.asp?path1,17627,1830,2
3059 - Kane, M. D. (2001). Current concerns in validity
theory. Journal of Educational Measurement, 38
(4), 319 342.
28Thank you!
raili.hilden_at_helsinki.fi