HKUSTNankai UniversityQueen'sMay 25, 2026arXiv:2605.25930

CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS

Yuhang Jia, Hui Wang, Jiaming Zhou, Yongchang Gan, Yong Qin

AI Summary

CosyEdit2 is introduced, a speech editing model trained using a two-stage post-training framework: supervised fine-tuning for initialization, followed by Group Relative Policy Optimization (GRPO) on target-speech-free data. GRPO overcomes limitations of paired editing data and coarse optimization signals by directly optimizing for editing quality. Experiments show CosyEdit2 significantly improves speech editing performance and surprisingly unlocks better zero-shot TTS, highlighting a synergistic relationship between the two tasks.

Key Contribution

Training for speech editing with reinforcement learning not only enhances editing quality but also unexpectedly boosts zero-shot TTS performance.

Abstract

Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT) enables TTS models to acquire functional editing capability, this approach remains fundamentally bottlenecked by imperfect paired editing data and coarse-grained optimization signals. To address these limitations, we propose CosyEdit2, a speech editing model built on a two-stage post-training framework that progresses from supervised editing initialization to editing-oriented Group Relative Policy Optimization (GRPO) over target-speech-free data. Extensive experiments demonstrate that CosyEdit2 not only substantially advances speech editing performance, but also unlocks better zero-shot TTS capability, revealing a deeper mutual relationship between the two tasks. Audio samples are available at https://cjy1018.github.io/CosyEdit2.

Natural Language Processing RLHF & Preference Learning Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS

Related Papers