Tsinghua AIDepartment of Computer ScienceMBZUAIUMDVirginia TechJun 17, 2026arXiv:2606.18709

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

Han Chen, Ming Li, Chenguang Wang, Yijun Liang, Dawei Zhou, Hong jiao, Tianyi Zhou

AI Summary

This study evaluates the ability of 42 large language models (LLMs) to measure item discrimination in reading comprehension assessments, a key psychometric property that differentiates students of varying proficiency levels. Using both direct discrimination prediction and response-based Classical Test Theory (CTT) calibration, the research finds that LLMs struggle to align with human-calibrated discrimination scores, with the best model achieving a Spearman correlation of only 0.152. The results indicate that while LLMs contain some relevant signals, they are not yet reliable for capturing the nuances of item discrimination in educational assessments.

Key Contribution

LLMs can identify some discrimination signals in assessment items, but their predictions fall significantly short of human benchmarks.

Abstract

Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various existing works have explored whether large language models (LLMs) can estimate item difficulty, it remains unclear whether they can capture item discrimination. In this work, we evaluate 42 proprietary and open-weight LLMs in zero-shot settings using two complementary approaches: direct discrimination prediction, where models explicitly estimate an item's discrimination value from its content, and response-based Classical Test Theory (CTT) calibration, where LLM answers are treated as synthetic student responses to compute discrimination scores. Our results show that direct prediction yields weak alignment with human-calibrated discrimination: the best-performing model reaches only a Spearman correlation of 0.152. Response-based CTT calibration provides a stronger but still limited signal, with the all-persona synthetic respondent pool reaching a Spearman correlation of 0.241. These findings highlight item discrimination as an open challenge for LLM-based psychometric evaluation: current LLMs contain non-random discrimination-relevant signal, but they do not yet reliably capture how assessment items distinguish human students.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

Related Papers