Search papers, labs, and topics across Lattice.
The paper introduces UniWhisper, a continual multi-task training framework designed to learn robust universal audio representations by casting diverse audio tasks into a unified instruction-answer format suitable for next-token prediction. This approach addresses the challenge of existing audio encoders that often specialize in one domain (e.g., speech) at the expense of others (e.g., environmental sounds or music). UniWhisper, trained on 38k hours of audio data, achieves significantly improved performance across 20 diverse audio tasks, as measured by MLP probes and kNN, while maintaining strong speech recognition capabilities.
Achieve state-of-the-art universal audio representation by unifying diverse audio tasks into a single next-token prediction framework, outperforming Whisper by a large margin.
A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task training framework that casts heterogeneous audio tasks into a unified instruction and answer format. This enables standard next-token training without task-specific heads and losses. We train it on 38k hours of public audio and assess the encoder using shallow MLP probes and k-nearest neighbors (kNN) on 20 tasks spanning speech, environmental sound, and music. UniWhisper reaches normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN, compared to 0.64 and 0.46 for Whisper, while retaining strong speech performance.