Search papers, labs, and topics across Lattice.
This paper introduces USAD 2.0, a universal audio encoder that integrates knowledge from both self-supervised learning (SSL) and supervised models to enhance audio understanding across multiple domains. By employing domain-aware distillation and a second-stage supervised distillation, the model addresses teacher mismatch and extends its capabilities to the music domain, ultimately scaling to one billion parameters. Experimental results demonstrate that USAD 2.0 achieves strong or state-of-the-art performance in both probing tasks and evaluations with large language models (LLMs), highlighting its effectiveness in diverse audio applications.
USAD 2.0 achieves state-of-the-art audio understanding by seamlessly integrating self-supervised and supervised learning techniques, scaling to one billion parameters.
Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self-supervised learning (SSL) has yielded strong domain-specific encoders like speech or music experts, multi-domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs. We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain-aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second-stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state-of-the-art performance across probing and LLM-based evaluations.