Institut Polytechnique de ParisApr 27, 2026arXiv:2604.24933

S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

Mohammed Ali El Adlouni, Aurian Quelennec, P. Chouteau, Geoffroy Peeters, S. Essid

AI Summary

This paper introduces S-SONDO, a self-supervised knowledge distillation framework for compressing general audio foundation models by distilling their output embeddings. S-SONDO overcomes limitations of prior supervised distillation methods by being architecture-agnostic and applicable to embedding-based teachers. Experiments distilling two audio foundation models into smaller students demonstrate up to 61x size reduction with minimal performance loss, alongside insights on loss functions and data sampling.

Key Contribution

Shrinking massive audio foundation models by up to 61x is now possible without significant performance loss, thanks to a novel self-supervised distillation approach that works directly on embeddings.

Abstract

General audio foundation models have recently achieved remarkable progress, enabling strong performance across diverse tasks. However, state-of-the-art models remain extremely large, often with hundreds of millions of parameters, leading to high inference costs and limited deployability on edge devices. Knowledge distillation is a proven strategy for model compression, but prior work in audio has mostly focused on supervised settings, relying on class logits, intermediate features, or architecture-specific techniques. Such assumptions exclude models that output only embeddings, such as self-supervised or metric-learning models. We introduce S-SONDO (Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models), the first framework to distill general audio models using only their output embeddings. By avoiding the need for logits or layer-level alignment, S-SONDO is architecture-agnostic and broadly applicable to embedding-based teachers. We demonstrate its effectiveness by distilling two audio foundation models into three efficient students that are up to 61 times smaller while retaining up to 96% of teacher performance. We also provide practical insights on loss choice and clustering-based balanced data sampling. Code is available here: https://github.com/MedAliAdlouni/ssondo.

Inference & Quantization Speech & Audio Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References29

Year2026

VenueIEEE International Conference on Acoustics, Speech, and Signal Processing

Related Papers

Finding related papers...

Search

S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

Related Papers