Feb 17, 2026arXiv:2602.15749

A Generative-First Neural Audio Autoencoder

Jonah Casebeer, Ge Zhu, Zhepei Wang, Nicholas J. Bryan

AI Summary

The paper introduces a generative-first neural audio autoencoder architecture designed for efficient and scalable generative modeling. By increasing temporal downsampling and supporting diverse audio representations and formats within a single model, the approach addresses limitations of reconstruction-first methods. The proposed autoencoder achieves a 10x speedup in encoding, a 1.6x reduction in latent rates, and eliminates the need for channel-specific models, while maintaining competitive reconstruction quality.

Key Contribution

Compressing 60-second audio into just 788 tokens, this new autoencoder makes generative audio modeling far more tractable by slashing encoding time and latent rates.

Abstract

Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent rates, slow encoding, and separate architectures for discrete vs. continuous latents and for different audio channel formats, hindering workflows from preprocessing to inference conditioning. We introduce a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x and supports continuous and discrete representations and common audio channel formats in one model. By balancing compression, quality, and speed, it delivers 10x faster encoding, 1.6x lower rates, and eliminates channel-format-specific variants while maintaining competitive reconstruction quality. This enables applications previously constrained by processing costs: a 60-second mono signal compresses to 788 tokens, making generative modeling more tractable.

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Generative-First Neural Audio Autoencoder

Related Papers