MIT CSAILJun 4, 2026arXiv:2606.06444

USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

Heng-Jui Chang, Alexander H. Liu, Saurabhchand Bhati, Mrudula Athi, Anton Ratnarajah, Amit Chhetri, James Glass

AI Summary

This paper introduces USAD 2.0, a universal audio encoder that integrates knowledge from both self-supervised learning (SSL) and supervised models to enhance audio understanding across multiple domains. By employing domain-aware distillation and a second-stage supervised distillation, the model addresses teacher mismatch and extends its capabilities to the music domain, ultimately scaling to one billion parameters. Experimental results demonstrate that USAD 2.0 achieves strong or state-of-the-art performance in both probing tasks and evaluations with large language models (LLMs), highlighting its effectiveness in diverse audio applications.

Key Contribution

USAD 2.0 achieves state-of-the-art audio understanding by seamlessly integrating self-supervised and supervised learning techniques, scaling to one billion parameters.

Abstract

Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self-supervised learning (SSL) has yielded strong domain-specific encoders like speech or music experts, multi-domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs. We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain-aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second-stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state-of-the-art performance across probing and LLM-based evaluations.

Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

Related Papers