Search papers, labs, and topics across Lattice.
This paper critiques the over-reliance on inter-rater reliability (IRR) metrics like Cohen's kappa as the sole indicator of "ground truth" in AI in Education (AIED) datasets, particularly with the rise of GenAI. It argues that high-inference constructs, skewed label distributions, and multimodal data in educational settings can lead to misinterpretations of IRR. The authors propose four shifts: treating IRR as a diagnostic tool, transparent reporting of annotation processes, mitigating biases in LLM annotation, and complementing agreement statistics with validity and effectiveness evidence.
Stop treating inter-rater reliability as a simple green light for "ground truth" in AIED – your data's probably messier than you think, especially with LLMs in the mix.
Generative Artificial Intelligence (GenAI) is now widespread in education, yet the efficacy of GenAI systems remains constrained by the quality and interpretation of the labeled data used to train and evaluate them. Studies commonly report inter-rater reliability (IRR), often summarized by a single coefficient such as Cohen's kappa (k), as a gatekeeper to ``ground truth.''We argue that many educational assessment and practice support settings include challenges, such as high-inference constructs, skewed label distributions, and temporally segmented multimodal data, which yield potential misapplication or misinterpretation of threshold-based heuristics for IRR. The growing use of large language models as annotators and judges introduces risks such as automation bias and circular validation. We propose four practical shifts for establishing ground truth: (1) treat IRR as a diagnostic signal to localize disagreement and refine constructs rather than a mechanical acceptance threshold (e.g., k>0.8); (2) require transparent reporting of rater expertise, codebook development, reconciliation procedures, and segmentation rules; (3) mitigate risks in LLM annotation through bias audits and verification workflows; and (4) complement agreement statistics with validity and effectiveness evidence for the intended use, including uncertainty-aware labeling (e.g., assigning different labels to the same item to capture nuance), criterion-related checks (e.g., predictive tests to check if labels forecast the intended outcome), and close-the-loop evaluations of whether systems trained on these labels improve learning beyond a reasonable control. We illustrate these shifts through case studies of multimodal tutoring data and provide actionable recommendations toward strengthening the evidence base of labeled AIED datasets.