Google ResearchIIScDec 6, 2025arXiv:2604.06702

ULTRAS -- Unified Learning of Transformer Representations for Audio and Speech Signals

P. E. Ameenudeen, Ameenudeen P E, Charumathi Narayanan, Charumathi Narayanan, Sriram Ganapathy, Sriram Ganapathy

AI Summary

ULTRAS, a unified self-supervised learning framework, is introduced to bridge the gap between time-domain speech processing and time-frequency audio representation learning. It employs a transformer architecture to encode spectral-patches of log-mel spectrograms and predicts masked segments using a combined spectral and temporal loss function. Empirical results across diverse speech and audio tasks demonstrate that ULTRAS outperforms existing baselines, indicating its effectiveness in learning joint time-frequency representations.

Key Contribution

Stop training separate models for audio and speech – ULTRAS learns unified representations that beat the state-of-the-art across both domains.

Abstract

Self-supervised learning (SSL) has driven impressive advances in speech processing by adopting time-domain prediction objectives, while audio representation learning frameworks operate on time-frequency spectrograms. Models optimized for one paradigm struggle to transfer to the other, highlighting the need for a joint framework. We propose Unified Learning of Transformer Representations for Audio and Speech (ULTRAS), where the masking and predictive modeling is performed over long patches of the data. The model, based on the transformer architecture, encodes spectral-patches of log-mel spectrogram features. The predictive modeling of masked segments is performed on spectral and temporal targets using a combined loss-function, forcing the representations to encode time and frequency traits. Experiments are performed on a variety of speech and audio tasks, where we illustrate that the ULTRAS framework achieves improved performance over other established baselines.

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ULTRAS -- Unified Learning of Transformer Representations for Audio and Speech Signals

Related Papers