SJTUApr 9, 2026arXiv:2604.08384

TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs

Jing Peng, Jing Peng, Chenghao Wang, Chenghao Wang, Yi Yang, Lirong Qian, Lirong Qian, Junjie Li, Yu Xi, Yu Xi, Shuai Wang, Shuai Wang, Kai Yu, Kai Yu

AI Summary

This paper introduces TASU2, a controllable framework for simulating Connectionist Temporal Classification (CTC) posteriors from text, enabling more precise control over Word Error Rate (WER) during speech LLM post-training. By allowing researchers to specify a WER range for the simulated CTC posteriors, TASU2 facilitates the creation of targeted post-training curricula that adapt speech LLMs to new domains or low-resource scenarios. Experiments demonstrate that TASU2 outperforms existing text-only alignment methods like TASU, as well as text-only fine-tuning and TTS-based augmentation, while also preserving source-domain performance.

Key Contribution

Forget expensive audio-text data collection: TASU2 lets you dial in the perfect amount of noise for training your speech LLM, all from text.

Abstract

Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcripts, but they provide limited control over uncertainty and error rate, making curriculum design largely heuristic. We propose \textbf{TASU2}, a controllable CTC simulation framework that simulates CTC posterior distributions under a specified WER range, producing text-derived supervision that better matches the acoustic decoding interface. This enables principled post-training curricula that smoothly vary supervision difficulty without TTS. Across multiple source-to-target adaptation settings, TASU2 improves in-domain and out-of-domain recognition over TASU, and consistently outperforms strong baselines including text-only fine-tuning and TTS-based augmentation, while mitigating source-domain performance degradation.

Data Curation & Synthetic Data Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References22

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs

Related Papers