Grading University Students with LLMs: Performance and Acceptance of a Canvas-Based Automation

Luke Korthals; Hannes Rosenbusch; Raoul Grasman; Ingmar Visser

doi:10.1007/978-3-031-99264-3_5

Abstract

Teachers in higher education spend considerable time grading assignments rather than tutoring students. Large language models (LLMs) could address this by generating human-like grades and feedback for assignments, but accounts of their practical application are still scarce.

We wrote Python code to integrate the Canvas learning management system with an API for GDPR-compliant LLM access. We used this AI system to grade and provide feedback on weekly assignments completed by graduate students enrolled in an introductory programming course.

LLM grading was fast, cost efficient, and relatively accurate: GPT-4o and human graders agreed perfectly in about 80% of 6345 evaluated student answers, while human and LLM grades were positively correlated across assignments. Students generally evaluated the feedback positively, but responses also showed that lower preliminary LLM grades could provoke frustration and anxiety.