Mar 12, 2026arXiv:2603.11957

CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

Pranav Raikote, Korbinian Randl, Ioanna Miliou, Athanasios Lakes, Panagiotis Papapetrou

AI Summary

CHiL(L)Grader is introduced as a human-in-the-loop framework for automated short-answer grading that incorporates calibrated confidence estimation using temperature scaling and selective prediction. The system automates grading for high-confidence predictions and routes uncertain cases to human graders, adapting to evolving rubrics through continual learning. Experiments on three datasets demonstrate that CHiL(L)Grader achieves expert-level quality (QWK>=0.80) on 35-65% of responses, with a significant QWK gap between accepted and rejected predictions, highlighting the effectiveness of confidence-based routing.

Key Contribution

LLMs can automate a surprisingly large fraction of short-answer grading tasks (35-65%) to expert-level quality, provided you route the uncertain cases to human graders and continually learn from feedback.

Abstract

Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK>= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References36

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

Related Papers