Search papers, labs, and topics across Lattice.
Whisper-AuT, a domain-adapted audio encoder, was created by fine-tuning Whisper-large-v3 on a mixed dataset of speech, environmental sound, and music. This adaptation addresses Whisper's weakness in representing non-speech audio, thereby reducing the downstream training burden for audio-LLMs. Empirical results demonstrate that Whisper-AuT significantly improves performance on environmental sound and music classification tasks while maintaining competitive speech recognition accuracy.
Whisper's speech-centric training leaves audio-LLMs tone-deaf to music and environmental sounds, but a simple fine-tune can fix that.
Audio-native large language models (audio-LLMs) commonly use Whisper as their audio encoder. However, Whisper was trained exclusively on speech data, producing weak representations for music and environmental sound. This forces downstream audio-LLMs to compensate through extensive training on large-scale non-speech data. We present Whisper-AuT, a domain-adapted audio encoder obtained by fine-tuning Whisper-large-v3 on a curated mixture of speech (80%), environmental sound (10%), and music (10%) totaling approximately 20M samples. The full encoder-decoder is trained end-to-end with a seq2seq captioning objective; the decoder is then discarded and only the encoder is retained. Linear probe evaluations show that Whisper-AuT achieves +23.0% on ESC-50 (environmental sound), +5.0% on GTZAN (music genre), and +0.7% on Speech Commands (keyword spotting) compared to the original Whisperlarge-v3 encoder. Whisper-AuT is designed as a drop-in replacement for Whisper in audio-LLM architectures, with the goal of reducing downstream training cost by providing stronger initial audio representations for non-speech domains.