Jun 8, 2026arXiv:2606.09019

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

Yejin Lee, Junwon Moon, Hyoeun Kim, Hyunjin Choi, Heeseung Kim, Kyuhong Shim

AI Summary

This paper introduces TLDR, a patch-based autoregressive framework that enhances the efficiency of codec-based text-to-speech (TTS) models by shifting the causal modeling from token-level to patch-level sequences. By grouping consecutive audio tokens into compact latent patches and utilizing a frozen pretrained AR-TTS backbone adapted with LoRA, TLDR achieves a significant 1.8x speedup in inference and reduces global KV-cache memory usage by up to 75%. These findings highlight a practical approach to mitigating the structural efficiency bottleneck in autoregressive TTS systems without necessitating major architectural changes.

Key Contribution

Shifting from token-level to patch-level modeling in TTS can yield a 1.8x speedup and drastically cut memory usage.

Abstract

Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that grows with the sequence length. We introduce TLDR, a patch-based autoregressive framework that accelerates codec-based AR-TTS by shifting the causal modeling from token-level speech sequences to patch-level sequences. TLDR groups consecutive codec tokens into compact latent patches using a lightweight compressor, models the resulting shorter patch sequence with a frozen pretrained AR-TTS backbone adapted by LoRA, and reconstructs fine-grained speech tokens within each patch using a speaker-conditioned extractor. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global KV-cache memory by up to 75%. Experimental results indicate that patch-level global causal modeling can be a practical way to reduce the inference cost of pretrained codec-based AR-TTS systems without replacing the existing modules.

Inference & Quantization Speech & Audio Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

Related Papers