Lumos RoboticsNJUSoochowWeNet Open Source CommunityJun 4, 2026arXiv:2606.06357

F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation

Dinghao Zhou, Xingchen Song, Di Wu, Pengyu Cheng, Shengfan Shen, Sixiang Lv

AI Summary

This paper introduces the F3-Tokenizer, which addresses the limitations of continuous audio autoencoders by integrating a noise-regularized bottleneck and a latent-side representation encoder. The approach allows for the generation of scale-controlled continuous latents that maintain high-quality reconstruction while enabling effective autoregressive generation. Key results show that the F3-Tokenizer achieves a balance between semantic understanding and generative capabilities in audio processing, outperforming traditional methods in both aspects.

Key Contribution

Achieving high-dimensional audio representations without sacrificing generative quality, the F3-Tokenizer redefines the boundaries of audio autoencoding.

Abstract

Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets

Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation

Related Papers