Search papers, labs, and topics across Lattice.
This paper introduces a model-agnostic method for detecting annotation errors in video datasets by analyzing the Cumulative Sample Loss (CSL) of individual frames across training epochs. The CSL, representing the average loss a frame incurs across model checkpoints, serves as a dynamic fingerprint of frame-level learnability, where erroneous frames exhibit persistently high or irregular loss patterns. Experiments on EgoPER and Cholec80 datasets demonstrate the method's effectiveness in identifying mislabeling and frame disordering without requiring ground truth on annotation errors.
Uncover annotation errors in your video datasets simply by tracking how frame-level loss changes during training.
High-quality video datasets are foundational for training robust models in tasks like action recognition, phase detection, and event segmentation. However, many real-world video datasets suffer from annotation errors such as *mislabeling*, where segments are assigned incorrect class labels, and *disordering*, where the temporal sequence does not follow the correct progression. These errors are particularly harmful in phase-annotated tasks, where temporal consistency is critical. We propose a novel, model-agnostic method for detecting annotation errors by analyzing the Cumulative Sample Loss (CSL)--defined as the average loss a frame incurs when passing through model checkpoints saved across training epochs. This per-frame loss trajectory acts as a dynamic fingerprint of frame-level learnability. Mislabeled or disordered frames tend to show consistently high or irregular loss patterns, as they remain difficult for the model to learn throughout training, while correctly labeled frames typically converge to low loss early. To compute CSL, we train a video segmentation model and store its weights at each epoch. These checkpoints are then used to evaluate the loss of each frame in a test video. Frames with persistently high CSL are flagged as likely candidates for annotation errors, including mislabeling or temporal misalignment. Our method does not require ground truth on annotation errors and is generalizable across datasets. Experiments on EgoPER and Cholec80 demonstrate strong detection performance, effectively identifying subtle inconsistencies such as mislabeling and frame disordering. The proposed approach provides a powerful tool for dataset auditing and improving training reliability in video-based machine learning.