SJTUMay 28, 2026arXiv:2605.29948

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

Bohan Li, Shiyue Lian, Hankun Wang, Yiwei Guo, Yu Xi, Zhihan Li, Da Zheng, Colin Zhang, Kai Yu

AI Summary

The paper introduces HoliTok, a continuous speech tokenizer that encodes 48kHz speech into a compact 25Hz sequence of 128-dimensional latents. HoliTok is trained with a progressive strategy to jointly preserve signal fidelity, incorporate semantic information, and maintain latent learnability. Experiments using a unified AR+DiT model demonstrate that HoliTok achieves competitive reconstruction fidelity and robust performance in both speech synthesis and recognition tasks without additional optimization tricks, suggesting its effectiveness as a foundational representation for unified spoken language modeling.

Key Contribution

Finally, a speech tokenizer that doesn't require extra optimization tricks to work robustly for both generation and understanding tasks in a unified architecture.

Abstract

Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architectural complexity and more involved training designs. We propose HoliTok, a continuous Holistic speech Tokenization model designed for unified generation-understanding modeling. HoliTok encodes 48~kHz speech into a compact 25~Hz sequence of 128-dimensional latents. It is trained with a progressive strategy that jointly preserves signal-level fidelity, incorporates semantic information, and maintains strong latent learnability. Based on this tokenization, we build a unified AR+DiT model for speech synthesis and recognition, where the same latent sequence supports both generation-specific and unified generation-understanding tasks. Experiments show that HoliTok achieves competitive reconstruction fidelity, improves generative learnability for high-quality and controllable synthesis, and, among the evaluated representations, is the only one that operates robustly in our unified generation-understanding architecture without additional optimization tricks. These results suggest that HoliTok serves as an effective speech tokenizer and a foundational representation interface for unified spoken language modeling. The code is available at: https://github.com/bovod-sjtu/HoliTok.

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

Citation Metrics

Citations0

Influential citations0

References51

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

Related Papers