Sunway College KathmanduApr 29, 2026arXiv:2604.26607

Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics

Jatin Bhusal, Nancy Mahatha, Aayush Acharya, Raunak Regmi

AI Summary

This paper introduces a human-in-the-loop benchmarking framework to evaluate LLMs for automated competency assessment in secondary-level mathematics, using a multi-dimensional rubric based on the Grade 10 curriculum in Nepal. They benchmarked open-weight models (Eagle, Orion) and proprietary models (Nova, Lyra) against a ground truth defined by mathematics faculty. The key finding is that architectural compatibility with instruction constraints, particularly for Gemini-based MoE models, outweighs the scale of raw parameters in rubric-constrained tasks, with the larger Orion model showing "No Agreement".

Key Contribution

Bigger isn't always better: in rubric-constrained math assessments, architectural compliance trumps parameter scale, as demonstrated by a 70B model failing where smaller MoEs succeeded.

Abstract

As Competency-Based Education (CBE) is gaining traction around the world, the shift from marks-based assessment to qualitative competency mapping is a manual challenge for educators. This paper tackles the bottleneck issue by suggesting a "Human-in-the-Loop" benchmarking framework to assess the effectiveness of multiple LLMs in automating secondary-level mathematics assessment. Based on the Grade 10 Optional Mathematics curriculum in Nepal, we created a multi-dimensional rubric for four topics and four cross-cutting competencies: Comprehension, Knowledge, Operational Fluency, and Behavior and Correlation. The multi-provider ensemble, consisted of open-weight models -- Eagle (Llama 3.1-8B) and Orion (Llama 3.3-70B) -- and proprietary frontier models Nova (Gemini 2.5 Flash) and Lyra (Gemini 3 Pro), was benchmarked against a ground truth defined by two senior mathematics faculty members (kappa_w = 0.8652). The findings show a marked "Architecture-compatibility gap". Although the Gemini-based Mixture-of-Experts (Sparse MoE) models achieved "Fair Agreement" (kappa_w ~ 0.38), the larger Orion (70B) model exhibited "No Agreement" (kappa_w = -0.0261), suggesting that architectural compliance with instruction constraints outweighs the scale of raw parameters in rubric-constrained tasks. We conclude that while LLMs are not yet suitable for autonomous certification, they provide high-value assistive support for preliminary evidence extraction within a "Human-in-the-Loop" framework.

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics

Related Papers