College of Computer Science and Software EngineeringGuangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)Mar 2, 2026arXiv:2603.01563

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

Chenxing Wei, Jiazheng Kang, Jianqing Zhang, Haozhe Jiang, Xiaolong Xu, Ningyuan Sun, Ying He, F. Yu, Yao Shu, Bo Jiang

AI Summary

Likelihood-Free Policy Optimization (LFPO) is introduced to address the challenge of applying reinforcement learning with verifiable rewards (RLVR) to diffusion large language models (dLLMs) due to the intractability of likelihood computation. LFPO maps vector field flow matching to the discrete token space, formulating alignment as geometric velocity rectification and optimizing denoising logits via contrastive updates, thus bypassing likelihood approximation errors. Experiments show LFPO outperforms state-of-the-art baselines on code and reasoning benchmarks and accelerates inference by 20% through reduced diffusion steps.

Key Contribution

Ditch the likelihood approximations: LFPO directly optimizes denoising logits in diffusion LMs via contrastive updates, leading to faster inference and better code/reasoning performance.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.

Code Generation & Program Synthesis Natural Language Processing RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References45

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

Related Papers