Search papers, labs, and topics across Lattice.
This paper addresses the challenge of limited labeled data in Automatic Chord Recognition (ACR) by proposing a two-stage training pipeline leveraging pre-trained models and unlabeled audio. The method first uses a pre-trained model to generate pseudo-labels for a large unlabeled audio dataset and trains a student model on these pseudo-labels, followed by continual training on ground-truth labels with selective knowledge distillation from the teacher model. Experiments demonstrate that the student models, trained using this pipeline, outperform both traditional supervised learning baselines and the original pre-trained teacher model, particularly on rare chord qualities.
Unlock better Automatic Chord Recognition by distilling knowledge from readily available pre-trained models using pseudo-labels on unlabeled data, surpassing the performance of both supervised baselines and the original teacher model.
Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord labels, as well-aligned annotations are costly to acquire. At the same time, open-weight pre-trained models are currently more accessible than their proprietary training data. In this work, we present a two-stage training pipeline that leverages pre-trained models together with unlabeled audio. The proposed method decouples training into two stages. In the first stage, we use a pre-trained BTC model as a teacher to generate pseudo-labels for over 1,000 hours of diverse unlabeled audio and train a student model solely on these pseudo-labels. In the second stage, the student is continually trained on ground-truth labels as they become available, with selective knowledge distillation (KD) from the teacher applied as a regularizer to prevent catastrophic forgetting of the representations learned in the first stage. In our experiments, two models (BTC, 2E1D) were used as students. In stage 1, using only pseudo-labels, the BTC student achieves over 98% of the teacher's performance, while the 2E1D model achieves about 96% across seven standard mir_eval metrics. After a single training run for both students in stage 2, the resulting BTC student model surpasses the traditional supervised learning baseline by 2.5% and the original pre-trained teacher model by 1.55% on average across all metrics. And the resulting 2E1D student model improves from the traditional supervised learning baseline by 3.79% on average and achieves almost the same performance as the teacher. Both cases show the large gains on rare chord qualities.