Search papers, labs, and topics across Lattice.
The paper addresses the scarcity of paired audio-MIDI data for automatic drum transcription (ADT) by introducing a semi-supervised method to create a large corpus of one-shot drum samples from unlabeled audio. This corpus is then used to synthesize a high-quality dataset from MIDI files, enabling training of a sequence-to-sequence ADT model without paired data. The resulting model achieves state-of-the-art performance on ENST and MDB datasets, surpassing both supervised and previous synthetic data approaches.
Forget painstakingly labeling audio-MIDI pairs: this method synthesizes high-quality drum transcription training data directly from unlabeled audio and MIDI, achieving state-of-the-art results.
Deep learning models define the state-of-the-art in Automatic Drum Transcription (ADT), yet their performance is contingent upon large-scale, paired audio-MIDI datasets, which are scarce. Existing workarounds that use synthetic data often introduce a significant domain gap, as they typically rely on low-fidelity SoundFont libraries that lack acoustic diversity. While high-quality one-shot samples offer a better alternative, they are not available in a standardized, large-scale format suitable for training. This paper introduces a new paradigm for ADT that circumvents the need for paired audio-MIDI training data. Our primary contribution is a semi-supervised method to automatically curate a large and diverse corpus of one-shot drum samples from unlabeled audio sources. We then use this corpus to synthesize a high-quality dataset from MIDI files alone, which we use to train a sequence-to-sequence transcription model. We evaluate our model on the ENST and MDB test sets, where it achieves new state-of-the-art results, significantly outperforming both fully supervised methods and previous synthetic-data approaches. The code for reproducing our experiments is publicly available at https://github.com/pier-maker92/ADT_STR