Search papers, labs, and topics across Lattice.
This paper introduces Trust Region Q-Adjoint Matching (TRQAM), an innovative off-policy reinforcement learning algorithm that enhances stability in fine-tuning pretrained flow policies by adaptively managing the path-space KL divergence through projected dual descent. The method addresses the fragility of previous critic-guided approaches by optimizing the trust-region parameter, thereby controlling deviations from pretrained policies and mitigating the risks of model collapse due to critic errors. Experimental results across 50 OGBench tasks demonstrate that TRQAM significantly outperforms existing methods, achieving a 68% success rate in offline RL compared to the strongest baseline at 46%.
TRQAM stabilizes off-policy reinforcement learning by precisely controlling deviations from pretrained policies, leading to a 68% success rate—22% higher than the best prior method.
Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter $\lambda$ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of $\lambda$. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.