Rating quality in rater mediated language assessment: a systematic literature review


  • Muhamad Firdaus MohdNoh
  • Mohd Ewan Effendi MohdMatore
  • Niusila Faamanatu-Eteuati
  • Norhidayu Rosman


rating quality, rater-mediated assessment, rater variability, rating indicators, language assessment


How has academic research in rating quality evolved over the last decade? Time has witnessed that previous researchers actively contributed to the development of knowledge particularly in ascertaining educational assessment to be updated with latest research-based practices. Thus, this review seeks to provide a bird’s-eye view of research development on rating quality over the last ten years focusing on factors influencing raters’ rating quality within the context of rater-mediated language assessment. This systematic literature review was conducted with the aid of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines through three stages, namely identification, screening and eligibility. Accordingly, the searching process has resulted in 43 articles to be thoroughly reviewed retrieved from two powerful database, Scopus and World of Science (WoS). Five major factors have emerged in response to the objective and they include rating experience, first language, rater training, familiarity and teaching experience. Analysis indicated that these factors lead to contradicting findings in terms of raters’ rating quality except for rater training factor. Only rater training was proven to be successful in mitigating rater effect and enhancing raters’ variability, severity and reliability. However, other factors were discovered to be inconclusive depending on whether they leave any impact on raters’ rating quality. The direction for future studies is also discussed suggesting the inclusion of more qualitative or mixed-method studies conducted to be reviewed using other possible techniques.


Download data is not yet available.


Ahmadi Shirazi, M. (2019). For a greater good: Bias analysis in writing assessment.

SAGE Open, 9(1), 1–14. https://doi.org/10.1177/2158244018822377

Ang-Aw, H. T., & Goh, C. C. M. (2011). Understanding discrepancies in rater judgement on national-level oral examination tasks. RELC Journal, 42(1), 31– 51. https://doi.org/10.1177/0033688210390226

Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99–115. https://doi.org/10.1177/0265532215582283

Barkaoui, K. (2010a). Do ESL essay raters’ evaluation criteria change with experience? A mixed-methods, cross-sectional study. TESOL Quarterly, 44(1), 31–57. https://doi.org/10.5054/tq.2010.214047

Barkaoui, K. (2010b). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54–74. https://doi.org/10.1080/15434300903464418

Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and rater performance. Assessment in Education: Principles, Policy and Practice, 18(3), 279–293. https://doi.org/10.1080/0969594X.2010.526585

Bijani, H. (2018). Investigating the validity of oral assessment rater training program: A mixed-methods study of raters’ perceptions and attitudes before and after training. Cogent Education, 33(1), 1–20. https://doi.org/10.1080/2331186X.2018.1460901

Bijani, H. (2019). Evaluating the effectiveness of the training program on direct and semi-direct oral proficiency assessment: A case of multifaceted Rasch analysis. Cogent Education, 6(1). https://doi.org/10.1080/2331186X.2019.1670592

Bijani, H., & Khabiri, M. (2017). Investigating the effect of training on raters’ bias toward test takers in oral proficiency assessment: A FACETS analysis. Journal of Asia TEFL, 14(4), 687–702. https://doi.org/10.18823/asiatefl.2017.

Carey, M. D., Mannell, R. H., & Dunn, P. K. (2011). Does a rater’s familiarity with a candidate’s pronunciation affect the rating in oral proficiency interviews? Language Testing, 28(2), 201–219. https://doi.org/10.1177/0265532210393704

Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117–135. https://doi.org/10.1177/0265532215582282

Duijm, K., Schoonen, R., & Hulstijn, J. H. (2017). Professional and non-professional raters’ responsiveness to fluency and accuracy in L2 speech: An experimental approach. Language Testing, 35(4), 501–527. https://doi.org/10.1177/0265532217712553

Eckstein, G., & Univer, B. Y. (2018). Assessment of L2 student writting : Does teacher

disciplinary background matter? Journal of Writing Research, 10(1), 1–23.

Engelhard, G., & Wind, S. A. (2018). Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments. Routledge. New York & London: Routledge. https://doi.org/10.1017/CBO9781107415324.004

Fleming, P. S., Koletsi, D., & Pandis, N. (2014). Blinded by PRISMA: Are systematic reviewers focusing on PRISMA and ignoring other guidelines? PLoS ONE, 9(5). https://doi.org/10.1371/journal.pone.0096407

Han, Q. (2016). Rater cognition in L2 speaking assessment: A review of the literature. Teachers College, Columbia University Working Papers in TESOL & Applied Linguistics, 16(1), 1–24. https://doi.org/10.7916/D8MS45DH

He, T.-H., Gou, W. J., Chien, Y.-C., Chen, I.-S. J., & Chang, S.-M. (2013). Multi-

faceted Rasch measurement and bias patterns in EFL writing performance

assessment. Psychological Reports, 112(2), 469–485. https://doi.org/10.2466/03.11.PR0.112.2.469-485

Hijikata-Someya, Y., Ono, M., & Yamanishi, H. (2015). Evaluation by native and non- native English teacher-raters of Japanese students’ summaries. English Language Teaching, 8(7), 1–12. https://doi.org/10.5539/elt.v8n7p1

Hsieh, C.-N. (2011). Rater effects in ITA testing: ESL teachers’ versus American undergraduates’ judgments of accentedness, comprehensibility, and oral proficiency. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 9, 47–74. Retrieved from https://michiganassessment.org/wp- content/uploads/2014/12/Spaan_V9_FULL.pdf#page=55

Huang, B., Alegre, A., & Eisenberg, A. (2016). A cross-linguistic investigation of the effect of raters’ accent familiarity on speaking assessment. Language Assessment Quarterly, 13(1), 25–41. https://doi.org/10.1080/15434303.2015.1134540

Huang, B. H. (2013). The effects of accent familiarity and language teaching experience on raters’ judgments of non-native speech. System, 41(3), 770–785. https://doi.org/10.1016/j.system.2013.07.009

Huang, B. H., & Jun, S. A. (2014). Age matters and so may raters: Rater differences in the assessment of foreign accents. Studies in Second Language Acquisition, 37(4), 623–650. https://doi.org/10.1017/S0272263114000576

Huang, L., Kubelec, S., Keng, N., & Hsu, L. (2018). Evaluating CEFR rater performance through the analysis of spoken learner corpora. Language Testing in Asia, 8(1), 1–17. https://doi.org/http://dx.doi.org/10.1186/s40468-018-0069-0

Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135–159. https://doi.org/10.1080/15434303.2013.769545

Kang, H. S., & Veitch, H. (2017). Mainstream teacher candidates’ perspectives on ESL writing: The effects of writer identity and rater background. TESOL Quarterly, 51(2), 249–274. https://doi.org/10.1002/tesq.289

Kang, O. (2012). Impact of rater characteristics and prosodic features of speaker accentedness on ratings of international teaching assistants’ oral performance. Language Assessment Quarterly, 9(3), 249–269. https://doi.org/10.1080/15434303.2011.642631

Kang, O., Rubin, D., & Kermad, A. (2019). The effect of training and rater differences on oral proficiency assessment. Language Testing, 36(4), 481–504. https://doi.org/10.1177/0265532219849522

Kim, H. J. (2015). A qualitative analysis of rater behavior on an L2 speaking assessment. Language Assessment Quarterly, 12(3), 239–261. https://doi.org/10.1080/15434303.2015.1049353

Leckie, G., & Baird, J. A. (2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational

Measurement, 48(4), 399–418. https://doi.org/10.1111/j.1745- 3984.2011.00152.x

Lee, K. R. (2016). Diversity among NEST raters: How do new and experienced NESTs evaluate Korean English learners’ essays? Asia-Pacific Education Researcher, 25(4), 549–558. https://doi.org/10.1007/s40299-016-0281-6

Lim, G. S. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28(4), 543–560. https://doi.org/10.1177/0265532211406422

Marefat, F., & Heydari, M. (2016). Native and Iranian teachers’ perceptions and evaluation of Iranian students’ English essays. Assessing Writing, 27(2016), 24– 36. https://doi.org/10.1016/j.asw.2015.10.001

McInnes, M. D. F., Moher, D., Thombs, B. D., McGrath, T. A., Bossuyt, P. M., Clifford, T., … Willis, B. H. (2018). Preferred reporting items for a systematic review and meta-analysis of diagnostic test accuracy studies The PRISMA-DTA statement. JAMA - Journal of the American Medical Association, 319(4), 388– 396. https://doi.org/10.1001/jama.2017.19163

Petticrew, M., & Roberts, H. (2006). Systematic Reviews in the Social Sciences. Systematic Reviews in the Social Sciences. Malden: Blackwell Publishing. https://doi.org/10.1002/9780470754887

Şahan, Ö., & Razı, S. (2020). Do experience and text quality matter for raters’ decision- making behaviors? Language Testing. https://doi.org/10.1177/0265532219900228

Saito, K., & Shintani, N. (2016). Foreign accentedness revisited: Canadian and Singaporean raters’ perception of Japanese-accented English. Language Awareness, 25(4), 305–317. https://doi.org/10.1080/09658416.2016.1229784

Sandlund, E., & Sundqvist, P. (2016). Equity in L2 English oral assessment: Criterion- based facts or works of fiction? NJES Nordic Journal of English Studies, 15(2), 113–131. https://doi.org/10.35360/njes.365

Schmid, M. S., & Hopp, H. (2014). Comparing foreign accent in L1 attrition and L2 acquisition: Range and rater effects. Language Testing, 31(3), 367–388. https://doi.org/10.1177/0265532214526175

Seker, M. (2018). Intervention in teachers’ differential scoring judgments in assessing L2 writing through communities of assessment practice. Studies in Educational Evaluation, 59(December 2017), 209–217. https://doi.org/10.1016/j.stueduc.2018.08.003

Stassenko, I., Skopinskaja, L., & Liiv, S. (2014). Investigating cultural variability in rater judgements of oral proficiency interviews. Eesti Rakenduslingvistika Uhingu Aastaraamat, (10), 269–281. https://doi.org/10.5128/ERYa10.17

Tajeddin, Z., & Alemi, M. (2014). Criteria and Bias in Native English Teachers’ Assessment of L2 Pragmatic Appropriacy: Content and FACETS Analyses. Asia-Pacific Education Researcher, 23(3), 425–434. https://doi.org/10.1007/s40299-013-0118-5

Tanriverdi-Koksal, F., & Ortactepe, D. (2017). Raters knowledge of students proficiency levels as a source of measurement error in oral assessments. Hacettepe University Journal of Education, 32(3), 1–19. https://doi.org/10.16986/HUJE.2017027583

Wei, J., & Llosa, L. (2015). Investigating Differences between American and Indian Raters in Assessing TOEFL iBT Speaking Tasks. Language Assessment Quarterly, 12(3), 283–304. https://doi.org/10.1080/15434303.2015.1037446

Wester, M., & Mayo, C. (2014). Accent Rating by Native and Non-native Listeners. 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), 7749–7753.

Wikse Barrow, C., Nilsson Björkenstam, K., & Strömbergsson, S. (2019). Subjective ratings of age-of-acquisition: Exploring issues of validity and rater reliability. Journal of Child Language, 46(2), 199–213. https://doi.org/10.1017/S0305000918000363

Wind, S. A., & Peterson, M. E. (2017). A systematic review of methods for evaluating rating quality in language assessment. Language Testing, 35(2), 161–192. https://doi.org/10.1177/0265532216686999

Winke, P., Gass, S., & Myford, C. (2013). Raters’ L2 background as a potential source of bias in rating oral performance. Language Testing, 30(2), 231–252. https://doi.org/10.1177/0265532212456968

Xi, X., & Mollaun, P. (2009). How do raters from India perform in scoring the TOEFL iBT speaking section and what kind of training helps? ETS Research Report Series, 2009(2), i–37. https://doi.org/10.1002/j.2333-8504.2009.tb02188.x

Xi, X., & Mollaun, P. (2011). Using raters from India to score a large-scale speaking test. Language Learning, 61(4), 1222–1255. https://doi.org/10.1111/j.1467- 9922.2011.00667.x

Zhang, Y., & Elder, C. (2014). Investigating native and non-native English-speaking teacher raters’ judgements of oral proficiency in the College English Test- Spoken English Test (CET-SET). Assessment in Education: Principles, Policy & Practice, 21(3), 306–325. https://doi.org/10.1080/0969594X.2013.845547




How to Cite

MohdNoh, M. F. ., MohdMatore, M. E. E. ., Faamanatu-Eteuati, N. ., & Rosman, N. . (2021). Rating quality in rater mediated language assessment: a systematic literature review. The Journal of Contemporary Issues in Business and Government, 27(2), 6096–6116. Retrieved from https://cibgp.com/au/index.php/1323-6903/article/view/1500