Search papers, labs, and topics across Lattice.
This paper introduces TLDR, a patch-based autoregressive framework that enhances the efficiency of codec-based text-to-speech (TTS) models by shifting the causal modeling from token-level to patch-level sequences. By grouping consecutive audio tokens into compact latent patches and utilizing a frozen pretrained AR-TTS backbone adapted with LoRA, TLDR achieves a significant 1.8x speedup in inference and reduces global KV-cache memory usage by up to 75%. These findings highlight a practical approach to mitigating the structural efficiency bottleneck in autoregressive TTS systems without necessitating major architectural changes.
Shifting from token-level to patch-level modeling in TTS can yield a 1.8x speedup and drastically cut memory usage.
Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that grows with the sequence length. We introduce TLDR, a patch-based autoregressive framework that accelerates codec-based AR-TTS by shifting the causal modeling from token-level speech sequences to patch-level sequences. TLDR groups consecutive codec tokens into compact latent patches using a lightweight compressor, models the resulting shorter patch sequence with a frozen pretrained AR-TTS backbone adapted by LoRA, and reconstructs fine-grained speech tokens within each patch using a speaker-conditioned extractor. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global KV-cache memory by up to 75%. Experimental results indicate that patch-level global causal modeling can be a practical way to reduce the inference cost of pretrained codec-based AR-TTS systems without replacing the existing modules.