Rating quality in rater mediated language assessment: a systematic literature review
Keywords:
rating quality, rater-mediated assessment, rater variability, rating indicators, language assessmentAbstract
How has academic research in rating quality evolved over the last decade? Time has witnessed that previous researchers actively contributed to the development of knowledge particularly in ascertaining educational assessment to be updated with latest research-based practices. Thus, this review seeks to provide a bird’s-eye view of research development on rating quality over the last ten years focusing on factors influencing raters’ rating quality within the context of rater-mediated language assessment. This systematic literature review was conducted with the aid of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines through three stages, namely identification, screening and eligibility. Accordingly, the searching process has resulted in 43 articles to be thoroughly reviewed retrieved from two powerful database, Scopus and World of Science (WoS). Five major factors have emerged in response to the objective and they include rating experience, first language, rater training, familiarity and teaching experience. Analysis indicated that these factors lead to contradicting findings in terms of raters’ rating quality except for rater training factor. Only rater training was proven to be successful in mitigating rater effect and enhancing raters’ variability, severity and reliability. However, other factors were discovered to be inconclusive depending on whether they leave any impact on raters’ rating quality. The direction for future studies is also discussed suggesting the inclusion of more qualitative or mixed-method studies conducted to be reviewed using other possible techniques.
Downloads
References
Ahmadi Shirazi, M. (2019). For a greater good: Bias analysis in writing assessment.
SAGE Open, 9(1), 1–14. https://doi.org/10.1177/2158244018822377
Ang-Aw, H. T., & Goh, C. C. M. (2011). Understanding discrepancies in rater judgement on national-level oral examination tasks. RELC Journal, 42(1), 31– 51. https://doi.org/10.1177/0033688210390226
Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99–115. https://doi.org/10.1177/0265532215582283
Barkaoui, K. (2010a). Do ESL essay raters’ evaluation criteria change with experience? A mixed-methods, cross-sectional study. TESOL Quarterly, 44(1), 31–57. https://doi.org/10.5054/tq.2010.214047
Barkaoui, K. (2010b). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54–74. https://doi.org/10.1080/15434300903464418
Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and rater performance. Assessment in Education: Principles, Policy and Practice, 18(3), 279–293. https://doi.org/10.1080/0969594X.2010.526585
Bijani, H. (2018). Investigating the validity of oral assessment rater training program: A mixed-methods study of raters’ perceptions and attitudes before and after training. Cogent Education, 33(1), 1–20. https://doi.org/10.1080/2331186X.2018.1460901
Bijani, H. (2019). Evaluating the effectiveness of the training program on direct and semi-direct oral proficiency assessment: A case of multifaceted Rasch analysis. Cogent Education, 6(1). https://doi.org/10.1080/2331186X.2019.1670592
Bijani, H., & Khabiri, M. (2017). Investigating the effect of training on raters’ bias toward test takers in oral proficiency assessment: A FACETS analysis. Journal of Asia TEFL, 14(4), 687–702. https://doi.org/10.18823/asiatefl.2017.14.4.7.687
Carey, M. D., Mannell, R. H., & Dunn, P. K. (2011). Does a rater’s familiarity with a candidate’s pronunciation affect the rating in oral proficiency interviews? Language Testing, 28(2), 201–219. https://doi.org/10.1177/0265532210393704
Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117–135. https://doi.org/10.1177/0265532215582282
Duijm, K., Schoonen, R., & Hulstijn, J. H. (2017). Professional and non-professional raters’ responsiveness to fluency and accuracy in L2 speech: An experimental approach. Language Testing, 35(4), 501–527. https://doi.org/10.1177/0265532217712553
Eckstein, G., & Univer, B. Y. (2018). Assessment of L2 student writting : Does teacher
disciplinary background matter? Journal of Writing Research, 10(1), 1–23.
Engelhard, G., & Wind, S. A. (2018). Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments. Routledge. New York & London: Routledge. https://doi.org/10.1017/CBO9781107415324.004
Fleming, P. S., Koletsi, D., & Pandis, N. (2014). Blinded by PRISMA: Are systematic reviewers focusing on PRISMA and ignoring other guidelines? PLoS ONE, 9(5). https://doi.org/10.1371/journal.pone.0096407
Han, Q. (2016). Rater cognition in L2 speaking assessment: A review of the literature. Teachers College, Columbia University Working Papers in TESOL & Applied Linguistics, 16(1), 1–24. https://doi.org/10.7916/D8MS45DH
He, T.-H., Gou, W. J., Chien, Y.-C., Chen, I.-S. J., & Chang, S.-M. (2013). Multi-
faceted Rasch measurement and bias patterns in EFL writing performance
assessment. Psychological Reports, 112(2), 469–485. https://doi.org/10.2466/03.11.PR0.112.2.469-485
Hijikata-Someya, Y., Ono, M., & Yamanishi, H. (2015). Evaluation by native and non- native English teacher-raters of Japanese students’ summaries. English Language Teaching, 8(7), 1–12. https://doi.org/10.5539/elt.v8n7p1
Hsieh, C.-N. (2011). Rater effects in ITA testing: ESL teachers’ versus American undergraduates’ judgments of accentedness, comprehensibility, and oral proficiency. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 9, 47–74. Retrieved from https://michiganassessment.org/wp- content/uploads/2014/12/Spaan_V9_FULL.pdf#page=55
Huang, B., Alegre, A., & Eisenberg, A. (2016). A cross-linguistic investigation of the effect of raters’ accent familiarity on speaking assessment. Language Assessment Quarterly, 13(1), 25–41. https://doi.org/10.1080/15434303.2015.1134540
Huang, B. H. (2013). The effects of accent familiarity and language teaching experience on raters’ judgments of non-native speech. System, 41(3), 770–785. https://doi.org/10.1016/j.system.2013.07.009
Huang, B. H., & Jun, S. A. (2014). Age matters and so may raters: Rater differences in the assessment of foreign accents. Studies in Second Language Acquisition, 37(4), 623–650. https://doi.org/10.1017/S0272263114000576
Huang, L., Kubelec, S., Keng, N., & Hsu, L. (2018). Evaluating CEFR rater performance through the analysis of spoken learner corpora. Language Testing in Asia, 8(1), 1–17. https://doi.org/http://dx.doi.org/10.1186/s40468-018-0069-0
Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135–159. https://doi.org/10.1080/15434303.2013.769545
Kang, H. S., & Veitch, H. (2017). Mainstream teacher candidates’ perspectives on ESL writing: The effects of writer identity and rater background. TESOL Quarterly, 51(2), 249–274. https://doi.org/10.1002/tesq.289
Kang, O. (2012). Impact of rater characteristics and prosodic features of speaker accentedness on ratings of international teaching assistants’ oral performance. Language Assessment Quarterly, 9(3), 249–269. https://doi.org/10.1080/15434303.2011.642631
Kang, O., Rubin, D., & Kermad, A. (2019). The effect of training and rater differences on oral proficiency assessment. Language Testing, 36(4), 481–504. https://doi.org/10.1177/0265532219849522
Kim, H. J. (2015). A qualitative analysis of rater behavior on an L2 speaking assessment. Language Assessment Quarterly, 12(3), 239–261. https://doi.org/10.1080/15434303.2015.1049353
Leckie, G., & Baird, J. A. (2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational
Measurement, 48(4), 399–418. https://doi.org/10.1111/j.1745- 3984.2011.00152.x
Lee, K. R. (2016). Diversity among NEST raters: How do new and experienced NESTs evaluate Korean English learners’ essays? Asia-Pacific Education Researcher, 25(4), 549–558. https://doi.org/10.1007/s40299-016-0281-6
Lim, G. S. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28(4), 543–560. https://doi.org/10.1177/0265532211406422
Marefat, F., & Heydari, M. (2016). Native and Iranian teachers’ perceptions and evaluation of Iranian students’ English essays. Assessing Writing, 27(2016), 24– 36. https://doi.org/10.1016/j.asw.2015.10.001
McInnes, M. D. F., Moher, D., Thombs, B. D., McGrath, T. A., Bossuyt, P. M., Clifford, T., … Willis, B. H. (2018). Preferred reporting items for a systematic review and meta-analysis of diagnostic test accuracy studies The PRISMA-DTA statement. JAMA - Journal of the American Medical Association, 319(4), 388– 396. https://doi.org/10.1001/jama.2017.19163
Petticrew, M., & Roberts, H. (2006). Systematic Reviews in the Social Sciences. Systematic Reviews in the Social Sciences. Malden: Blackwell Publishing. https://doi.org/10.1002/9780470754887
Şahan, Ö., & Razı, S. (2020). Do experience and text quality matter for raters’ decision- making behaviors? Language Testing. https://doi.org/10.1177/0265532219900228
Saito, K., & Shintani, N. (2016). Foreign accentedness revisited: Canadian and Singaporean raters’ perception of Japanese-accented English. Language Awareness, 25(4), 305–317. https://doi.org/10.1080/09658416.2016.1229784
Sandlund, E., & Sundqvist, P. (2016). Equity in L2 English oral assessment: Criterion- based facts or works of fiction? NJES Nordic Journal of English Studies, 15(2), 113–131. https://doi.org/10.35360/njes.365
Schmid, M. S., & Hopp, H. (2014). Comparing foreign accent in L1 attrition and L2 acquisition: Range and rater effects. Language Testing, 31(3), 367–388. https://doi.org/10.1177/0265532214526175
Seker, M. (2018). Intervention in teachers’ differential scoring judgments in assessing L2 writing through communities of assessment practice. Studies in Educational Evaluation, 59(December 2017), 209–217. https://doi.org/10.1016/j.stueduc.2018.08.003
Stassenko, I., Skopinskaja, L., & Liiv, S. (2014). Investigating cultural variability in rater judgements of oral proficiency interviews. Eesti Rakenduslingvistika Uhingu Aastaraamat, (10), 269–281. https://doi.org/10.5128/ERYa10.17
Tajeddin, Z., & Alemi, M. (2014). Criteria and Bias in Native English Teachers’ Assessment of L2 Pragmatic Appropriacy: Content and FACETS Analyses. Asia-Pacific Education Researcher, 23(3), 425–434. https://doi.org/10.1007/s40299-013-0118-5
Tanriverdi-Koksal, F., & Ortactepe, D. (2017). Raters knowledge of students proficiency levels as a source of measurement error in oral assessments. Hacettepe University Journal of Education, 32(3), 1–19. https://doi.org/10.16986/HUJE.2017027583
Wei, J., & Llosa, L. (2015). Investigating Differences between American and Indian Raters in Assessing TOEFL iBT Speaking Tasks. Language Assessment Quarterly, 12(3), 283–304. https://doi.org/10.1080/15434303.2015.1037446
Wester, M., & Mayo, C. (2014). Accent Rating by Native and Non-native Listeners. 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), 7749–7753.
Wikse Barrow, C., Nilsson Björkenstam, K., & Strömbergsson, S. (2019). Subjective ratings of age-of-acquisition: Exploring issues of validity and rater reliability. Journal of Child Language, 46(2), 199–213. https://doi.org/10.1017/S0305000918000363
Wind, S. A., & Peterson, M. E. (2017). A systematic review of methods for evaluating rating quality in language assessment. Language Testing, 35(2), 161–192. https://doi.org/10.1177/0265532216686999
Winke, P., Gass, S., & Myford, C. (2013). Raters’ L2 background as a potential source of bias in rating oral performance. Language Testing, 30(2), 231–252. https://doi.org/10.1177/0265532212456968
Xi, X., & Mollaun, P. (2009). How do raters from India perform in scoring the TOEFL iBT speaking section and what kind of training helps? ETS Research Report Series, 2009(2), i–37. https://doi.org/10.1002/j.2333-8504.2009.tb02188.x
Xi, X., & Mollaun, P. (2011). Using raters from India to score a large-scale speaking test. Language Learning, 61(4), 1222–1255. https://doi.org/10.1111/j.1467- 9922.2011.00667.x
Zhang, Y., & Elder, C. (2014). Investigating native and non-native English-speaking teacher raters’ judgements of oral proficiency in the College English Test- Spoken English Test (CET-SET). Assessment in Education: Principles, Policy & Practice, 21(3), 306–325. https://doi.org/10.1080/0969594X.2013.845547
Downloads
Published
How to Cite
Issue
Section
License
You are free to:
- Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
- The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
- Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation .
No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.