Search papers, labs, and topics across Lattice.
This paper critiques the over-reliance on inter-rater reliability (IRR) metrics like Cohen's kappa as the sole indicator of "ground truth" in AI in Education (AIED) datasets, particularly given the complexities of educational data and the rise of LLM annotators. It proposes four shifts: treating IRR as a diagnostic tool, transparent reporting of annotation processes, mitigating LLM annotation biases, and complementing agreement statistics with validity and effectiveness evidence. The authors argue that these shifts are crucial for improving the reliability and validity of labeled data used to train and evaluate GenAI systems in education.
Stop treating inter-rater reliability as a simple green light for "ground truth" in AIED; instead, use it to diagnose disagreements and validate real-world impact.
Generative Artificial Intelligence (GenAI) is now widespread in education, yet the efficacy of GenAI systems remains constrained by the quality and interpretation of the labeled data used to train and evaluate them. Studies commonly report inter-rater reliability (IRR), often summarized by a single coefficient such as Cohen's kappa (k), as a gatekeeper to ``ground truth.''We argue that many educational assessment and practice support settings include challenges, such as high-inference constructs, skewed label distributions, and temporally segmented multimodal data, which yield potential misapplication or misinterpretation of threshold-based heuristics for IRR. The growing use of large language models as annotators and judges introduces risks such as automation bias and circular validation. We propose four practical shifts for establishing ground truth: (1) treat IRR as a diagnostic signal to localize disagreement and refine constructs rather than a mechanical acceptance threshold (e.g., k>0.8); (2) require transparent reporting of rater expertise, codebook development, reconciliation procedures, and segmentation rules; (3) mitigate risks in LLM annotation through bias audits and verification workflows; and (4) complement agreement statistics with validity and effectiveness evidence for the intended use, including uncertainty-aware labeling (e.g., assigning different labels to the same item to capture nuance), criterion-related checks (e.g., predictive tests to check if labels forecast the intended outcome), and close-the-loop evaluations of whether systems trained on these labels improve learning beyond a reasonable control. We illustrate these shifts through case studies of multimodal tutoring data and provide actionable recommendations toward strengthening the evidence base of labeled AIED datasets.