Tsinghua AIAmphion Technology CoCUHKShenzhen Loop Area InstituteTencent AIJun 9, 2026arXiv:2606.10581

ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

Yuxiang Wang, Qinke Ni, Shengbo Cai, Wan Lin, Liqiang Zhang, Zhizheng Wu

AI Summary

This paper introduces ParaBridge, an on-policy self-distillation method that enhances the ability of Speech Language Models (SLMs) to respond appropriately to paralinguistic cues during open-ended dialogue. By using a paralinguistic instruction scaffold during inference, the method effectively narrows the perception-behavior gap, leading to significant improvements in dialogue quality without requiring curated dialogues or human labels. Results show that ParaBridge increases the VoxSafeBench SAR from 14.6% to 40.3% and boosts EchoMind ratings from 3.27 to 3.92 while maintaining general performance across various benchmarks.

Key Contribution

Paralinguistic cues can be effectively harnessed in dialogue systems, leading to a 175% improvement in safety response accuracy without compromising overall model performance.

Abstract

Speech carries more information than just words: a child's voice, a fearful tone, or a noisy background should all lead a sufficiently competent spoken-dialogue assistant to different replies. Current Speech Language Models (SLMs) can recognize such paralinguistic cues but often ignore them in open-ended dialogue. We observe that a simple paralinguistic instruction scaffold at the inference stage narrows this perception-behavior gap, suggesting that the relevant cues are already latent in the model. Such scaffolds, however, remain brittle under multi-turn context and competing instructions. Therefore, we propose \textbf{ParaBridge}, an on-policy self-distillation method that turns a brittle inference-time scaffold into stable model behavior. During training, the scaffold serves only as a temporary privileged view; the scaffold-free model rolls out its own response, while the scaffolded view supplies dense, full-vocabulary next-token targets along its trajectory. This supervision teaches when non-lexical cues should affect the reply without the need for curated dialogues, human labels, or external reward models. On Qwen3-Omni-thinking, ParaBridge raises scaffold-free VoxSafeBench SAR from $14.6\%$ to $40.3\%$ and improves EchoMind average rating from $3.27$ to $3.92$. It also preserves general ability, with MMAU-Pro, VoiceBench, and GPQA all within $0.4$ points of the original model. Beyond the training distribution, ParaBridge generalizes to unseen paralinguistic cues, transfers from safety-oriented training to empathy-oriented dialogue, and works on a different SLM backbone.

Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

Related Papers