Search papers, labs, and topics across Lattice.
This meta-analysis examines the performance of LLMs in automated short-answer scoring across 890 results from existing studies, using mixed-effects metaregression to model Quadratic Weighted Kappa (QWK). The study reveals that LLMs struggle with scoring tasks considered easy for humans and that decoder-only architectures underperform encoders by a significant margin (0.37 QWK). Furthermore, the research identifies sensitivities to wording, tokenization, and biases, including racial discrimination, in high-stakes educational contexts.
LLMs stumble on short-answer scoring tasks that are easy for humans, and exhibit racial bias in high-stakes educational contexts.
Automated short-answer scoring lags other LLM applications. We meta-analyze 890 culminating results across a systematic review of LLM short-answer scoring studies, modeling the traditional effect size of Quadratic Weighted Kappa (QWK) with mixed effects metaregression. We quantitatively illustrate that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance. Particularly, we show that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs. Whether by poor implementation by thoughtful researchers or patterns traceable to autoregressive training, on average decoder-only architectures underperform encoders by 0.37--a substantial difference in agreement with humans. Additionally, we measure the contributions of various aspects of LLM technology on successful scoring such as tokenizer vocabulary size, which exhibits diminishing returns--potentially due to undertrained tokens. Findings argue for systems design which better anticipates known statistical shortcomings of autoregressive models. Finally, we provide additional experiments to illustrate wording and tokenization sensitivity and bias elicitation in high-stakes education contexts, where LLMs demonstrate racial discrimination. Code and data for this study are available.