Mar 18, 2026arXiv:2603.17685

Flow Matching Policy with Entropy Regularization

Ting Gao, Stavros Orfanoudakis, Nan Lin, Elvin Isufi, Winnie Daamen, Serge Hoogendoorn

AI Summary

The paper introduces Flow Matching Policy with Entropy Regularization (FMER), an ODE-based online RL framework that parameterizes the policy via flow matching and samples actions along a straight probability path. FMER constructs an advantage-weighted target velocity field from a candidate set, guiding policy updates toward high-value regions, and derives a tractable entropy objective for principled maximum-entropy optimization. Experiments show FMER outperforms state-of-the-art methods on sparse multi-goal FrankaKitchen benchmarks and achieves significant training time reductions compared to diffusion-based baselines.

Key Contribution

Ditch slow diffusion policies: FMER achieves 7x faster training and superior performance in sparse reward RL by using flow matching and a tractable entropy regularization term.

Abstract

Diffusion-based policies have gained significant popularity in Reinforcement Learning (RL) due to their ability to represent complex, non-Gaussian distributions. Stochastic Differential Equation (SDE)-based diffusion policies often rely on indirect entropy control due to the intractability of the exact entropy, while also suffering from computationally prohibitive policy gradients through the iterative denoising chain. To overcome these issues, we propose Flow Matching Policy with Entropy Regularization (FMER), an Ordinary Differential Equation (ODE)-based online RL framework. FMER parameterizes the policy via flow matching and samples actions along a straight probability path, motivated by optimal transport. FMER leverages the model's generative nature to construct an advantage-weighted target velocity field from a candidate set, steering policy updates toward high-value regions. By deriving a tractable entropy objective, FMER enables principled maximum-entropy optimization for enhanced exploration. Experiments on sparse multi-goal FrankaKitchen benchmarks demonstrate that FMER outperforms state-of-the-art methods, while remaining competitive on standard MuJoco benchmarks. Moreover, FMER reduces training time by 7x compared to heavy diffusion baselines (QVPO) and 10-15% relative to efficient variants.

Robotics & Embodied AI Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Flow Matching Policy with Entropy Regularization

Related Papers