ShanghaiTechMay 28, 2026arXiv:2605.30056

Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

Shutong Ding, Zejia Zhong, Zhongyi Wang, Ke Hu, Bikang Pan, Jingya Wang, Ye Shi

AI Summary

This paper introduces Critic-Guided Policy Optimization (CGPO), a novel diffusion-based reinforcement learning algorithm that balances exploration and exploitation by integrating a training-free critic guidance technique into the diffusion policy's denoising process. CGPO steers action generation toward high-value regions defined by the critic network and uses these guided actions as regression objectives, leading to faster convergence and improved performance. Experiments on MuJoCo locomotion tasks and Franka robot arm grasping tasks demonstrate state-of-the-art performance compared to existing diffusion-based RL methods, marking the first successful application of diffusion policy in real-world RL.

Key Contribution

Diffusion-based RL gets a boost: training-free critic guidance unlocks state-of-the-art performance and, for the first time, real-world robot manipulation.

Abstract

Recent advances in reinforcement learning (RL) have achieved great successes by leveraging the multimodality and exploration capability of diffusion policies. Among these approaches, one representative branch focuses on the sampling-based policy optimization. This design enables better exploration capability of the diffusion model, particularly at the beginning of training, but suffer from low exploitation in Q-value information, resulting in a slow policy convergence. Another branch pays attention to gradient-based policy optimization, which sufficiently exploits the gradient of the Q function yet tends to collapse into a unimodal policy with low diversity. To address this issue, we propose CGPO, \textbf{C}ritic-\textbf{G}uided diffusion \textbf{P}olicy \textbf{O}ptimization, which effectively balances exploration and exploitation with the training-free guidance technique integrated into the denoising process of diffusion policy. Concretely, CGPO steers action generation toward high-value regions defined by the critic network and uses the guided actions as regression objectives. In this manner, CGPO reduces the time required to obtain high-quality actions and improves final performance with better balance between the exploration-exploitation tradeoff. We validate the effectiveness of CGPO on 5 MuJoCo locomotion tasks, and CGPO achieves state-of-the-art performance compared with existing diffusion-based RL methods. Notably, CGPO is the first success to incorporate diffusion policy into real-world RL, with its superior performance on Franka robot arm grasping tasks. Our official page is released at https://dingsht.tech/cgpo-webpage.

Robotics & Embodied AI Training Efficiency & Optimization World Models & Planning

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

Related Papers