ZJUApr 16, 2026arXiv:2604.14932

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Yifu Chen, Shengpeng Ji, Qian Chen, Tianle Liang, Yangzhuo Li, Ziqing Wang, Wen Wang, Jingyu Lu, Haoxiao Wang, Xue Pu, Xueyi Pu, Fan Zhuo, Zhou Zhao

AI Summary

The paper addresses the challenge of applying reinforcement learning from preference (RL) to end-to-end spoken dialogue models, which are limited by the interaction between sparse preference supervision and dense speech generation. They propose a modality-aware adaptive post-training recipe, WavAlign, that constrains preference updates to the semantic channel and anchors acoustic behavior, dynamically regulating their mixture based on rollout statistics. Experiments across spoken dialogue benchmarks demonstrate that WavAlign consistently improves both semantic quality and speech expressiveness in spoken dialogue models.

Key Contribution

Reinforcement learning can now be practically applied to spoken dialogue models, thanks to a new post-training recipe that disentangles semantic and acoustic updates.

Abstract

End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.

Natural Language Processing RLHF & Preference Learning Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Related Papers