Search papers, labs, and topics across Lattice.
The paper introduces GLASS, a novel framework for zero-shot text-to-speech (TTS) that enables composable acoustic style control by utilizing post-generation rewards instead of traditional style labels. By employing Group Relative Policy Optimization (GRPO) to train lightweight LoRA adapters for each acoustic attribute, GLASS allows for independent manipulation of prosodic features like speaking rate and pitch while maintaining the integrity of speaker identity and intelligibility. Experimental results highlight GLASS's ability to achieve targeted style modifications and smooth interpolation between styles without the need for retraining the TTS backbone, showcasing its versatility in acoustic style steering.
GLASS enables seamless acoustic style manipulation in TTS, allowing for independent control of speaking rate and pitch without compromising speaker identity or intelligibility.
We propose GLASS, a framework for composable acoustic style control in zero-shot autoregressive text-to-speech (TTS) that learns controls from post-generation rewards rather than style labels. In zero-shot TTS, a speaker prompt often entangles speaker identity with prosodic attributes such as speaking rate and pitch, making it difficult to change style without changing the prompt itself. GLASS instead treats each acoustic attribute as a reward-defined control direction. For each control axis, GLASS freezes the TTS backbone and trains one lightweight LoRA adapter with Group Relative Policy Optimization (GRPO), using speech-token length and mean F0 as style rewards and WER as an intelligibility anchor. Because each control is represented as a LoRA weight update, independently trained adapters can be swapped, interpolated, and composed through linear LoRA arithmetic without retraining the backbone. Experiments on speaking rate and pitch control show targeted style shifts while preserving naturalness, speaker similarity, and intelligibility, and demonstrate smooth interpolation and multi-axis composition across independently trained adapters.