Feb 25, 2026arXiv:2602.21772

UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation

Yuxuan Chen, Peize He, Haoyuan Xu, Junzi Zhang

AI Summary

The paper introduces UniWhisper, a continual multi-task training framework designed to learn robust universal audio representations by casting diverse audio tasks into a unified instruction-answer format suitable for next-token prediction. This approach addresses the challenge of existing audio encoders that often specialize in one domain (e.g., speech) at the expense of others (e.g., environmental sounds or music). UniWhisper, trained on 38k hours of audio data, achieves significantly improved performance across 20 diverse audio tasks, as measured by MLP probes and kNN, while maintaining strong speech recognition capabilities.

Key Contribution

Achieve state-of-the-art universal audio representation by unifying diverse audio tasks into a single next-token prediction framework, outperforming Whisper by a large margin.

Abstract

A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task training framework that casts heterogeneous audio tasks into a unified instruction and answer format. This enables standard next-token training without task-specific heads and losses. We train it on 38k hours of public audio and assess the encoder using shallow MLP probes and k-nearest neighbors (kNN) on 20 tasks spanning speech, environmental sound, and music. UniWhisper reaches normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN, compared to 0.64 and 0.46 for Whisper, while retaining strong speech performance.

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation

Related Papers