Search papers, labs, and topics across Lattice.
This study evaluates the ability of 42 large language models (LLMs) to measure item discrimination in reading comprehension assessments, a key psychometric property that differentiates students of varying proficiency levels. Using both direct discrimination prediction and response-based Classical Test Theory (CTT) calibration, the research finds that LLMs struggle to align with human-calibrated discrimination scores, with the best model achieving a Spearman correlation of only 0.152. The results indicate that while LLMs contain some relevant signals, they are not yet reliable for capturing the nuances of item discrimination in educational assessments.
LLMs can identify some discrimination signals in assessment items, but their predictions fall significantly short of human benchmarks.
Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various existing works have explored whether large language models (LLMs) can estimate item difficulty, it remains unclear whether they can capture item discrimination. In this work, we evaluate 42 proprietary and open-weight LLMs in zero-shot settings using two complementary approaches: direct discrimination prediction, where models explicitly estimate an item's discrimination value from its content, and response-based Classical Test Theory (CTT) calibration, where LLM answers are treated as synthetic student responses to compute discrimination scores. Our results show that direct prediction yields weak alignment with human-calibrated discrimination: the best-performing model reaches only a Spearman correlation of 0.152. Response-based CTT calibration provides a stronger but still limited signal, with the all-persona synthetic respondent pool reaching a Spearman correlation of 0.241. These findings highlight item discrimination as an open challenge for LLM-based psychometric evaluation: current LLMs contain non-random discrimination-relevant signal, but they do not yet reliably capture how assessment items distinguish human students.