May 26, 2026arXiv:2605.27079

Trust Region Q Adjoint Matching

Yong Dong, Yonghoon Dong, Kyungmin Lee, Changyeon Kim, Jaehyuk Kim, Jinwoo Shin

AI Summary

This paper introduces Trust Region Q-Adjoint Matching (TRQAM), an innovative off-policy reinforcement learning algorithm that enhances stability in fine-tuning pretrained flow policies by adaptively managing the path-space KL divergence through projected dual descent. The method addresses the fragility of previous critic-guided approaches by optimizing the trust-region parameter, thereby controlling deviations from pretrained policies and mitigating the risks of model collapse due to critic errors. Experimental results across 50 OGBench tasks demonstrate that TRQAM significantly outperforms existing methods, achieving a 68% success rate in offline RL compared to the strongest baseline at 46%.

Key Contribution

TRQAM stabilizes off-policy reinforcement learning by precisely controlling deviations from pretrained policies, leading to a 68% success rate—22% higher than the best prior method.

Abstract

Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter $\lambda$ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of $\lambda$. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.

Robotics & Embodied AI Training Efficiency & Optimization World Models & Planning

Citation Metrics

Citations0

Influential citations0

References56

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Trust Region Q Adjoint Matching

Related Papers