Towards Reliable LLM Grading Through Self-Consistency and Selective Human Review: Higher Accuracy, Less Work

Luke Korthals; Emma Akrong; Gali Geller; Hannes Rosenbusch; Raoul Grasman; Ingmar Visser

doi:10.3390/make8030074

Abstract

Large language models (LLMs) show promise for grading open-ended assessments but still exhibit inconsistent accuracy, systematic biases, and limited reliability across assignments. To address these concerns, we introduce SURE (Selective Uncertainty-based Re-Evaluation), a human-in-the-loop pipeline that combines repeated LLM prompting, uncertainty-based flagging, and selective human regrading.

Three LLMs graded answers of 46 students to 130 open questions and coding exercises across five assignments. Each student answer was scored 20 times to derive majority-voted predictions and self-consistency-based certainty estimates. We then simulated human regrading of low-certainty cases and evaluated the resulting grading quality and efficiency.

Across assignments, single-prompt grading showed substantial underscoring, and majority voting improved but did not eliminate this bias. Selective human regrading based on low certainty improved grading accuracy while reducing manual grading time by 40% to 90%. Ensemble-based grading most consistently approached human-level accuracy, with 70% to 90% of assignment grades falling inside human-grader ranges.