Search papers, labs, and topics across Lattice.
The paper introduces Reaction Aware Policy Optimization (RAPO), a reinforcement learning framework for emotional support dialogue systems that leverages user reactions as a primary learning signal. RAPO uses simulated user responses to generate dense natural-language feedback, combining Hindsight Dialogue Selection to identify key turns, Generative Hindsight Feedback to create contrastive signals, and Scalar-Verbal Hybrid Policy Optimization to refine the dialogue policy. Experiments on ESC and Sotopia datasets demonstrate that RAPO outperforms existing RL baselines in achieving positive interaction outcomes.
Emotional support chatbots get a boost by learning directly from simulated user reactions, generating natural language critiques that drive better conversations.
While current emotional support dialogue systems typically rely on expert-defined scalar rewards for alignment, these signals suffer from severe information sparsity. They cannot explain why a response failed or how to adapt to dynamic user states, often diverging from the actual goal of facilitating positive emotional shifts. In practice, the most direct and reliable learning signal emerges from the user's continuous reactions during ongoing interaction. We therefore propose Reaction Aware Policy Optimization (RAPO), a framework that optimizes over interaction consequences rather than rubric scores. RAPO treats dialogue as a reaction-driven process and utilizes simulated user responses to generate dense natural-language feedback through three core components: Hindsight Dialogue Selection, which isolates pivotal turns that meaningfully alter user emotional trajectories; Generative Hindsight Feedback, which transforms user reactions into contrastive ranking signals and natural-language critiques; and Scalar-Verbal Hybrid Policy Optimization, which couples scalar reward optimization for global alignment with verbal feedback distillation for fine-grained semantic refinement. Extensive experiments on ESC and Sotopia demonstrate that RAPO significantly outperforms strong reinforcement learning baselines in driving positive interaction outcomes.