Search papers, labs, and topics across Lattice.
CosyEdit2 is introduced, a speech editing model trained using a two-stage post-training framework: supervised fine-tuning for initialization, followed by Group Relative Policy Optimization (GRPO) on target-speech-free data. GRPO overcomes limitations of paired editing data and coarse optimization signals by directly optimizing for editing quality. Experiments show CosyEdit2 significantly improves speech editing performance and surprisingly unlocks better zero-shot TTS, highlighting a synergistic relationship between the two tasks.
Training for speech editing with reinforcement learning not only enhances editing quality but also unexpectedly boosts zero-shot TTS performance.
Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT) enables TTS models to acquire functional editing capability, this approach remains fundamentally bottlenecked by imperfect paired editing data and coarse-grained optimization signals. To address these limitations, we propose CosyEdit2, a speech editing model built on a two-stage post-training framework that progresses from supervised editing initialization to editing-oriented Group Relative Policy Optimization (GRPO) over target-speech-free data. Extensive experiments demonstrate that CosyEdit2 not only substantially advances speech editing performance, but also unlocks better zero-shot TTS capability, revealing a deeper mutual relationship between the two tasks. Audio samples are available at https://cjy1018.github.io/CosyEdit2.