Search papers, labs, and topics across Lattice.
Safe Flow Q-Learning (SafeFQL) is introduced as a novel approach to offline safe reinforcement learning, combining Hamilton-Jacobi reachability-inspired safety value functions with a one-step flow policy for efficient and safe action selection. The method learns a safety value function through self-consistency Bellman recursion and trains a flow policy via behavioral cloning, distilling it into a one-step actor. A conformal prediction calibration step is incorporated to address finite-data approximation errors, providing probabilistic safety coverage.
SafeFQL achieves state-of-the-art safety in offline RL with significantly lower inference latency than diffusion-based methods, making it suitable for real-time safety-critical applications.
Offline safe reinforcement learning (RL) seeks reward-maximizing policies from static datasets under strict safety constraints. Existing methods often rely on soft expected-cost objectives or iterative generative inference, which can be insufficient for safety-critical real-time control. We propose Safe Flow Q-Learning (SafeFQL), which extends FQL to safe offline RL by combining a Hamilton--Jacobi reachability-inspired safety value function with an efficient one-step flow policy. SafeFQL learns the safety value via a self-consistency Bellman recursion, trains a flow policy by behavioral cloning, and distills it into a one-step actor for reward-maximizing safe action selection without rejection sampling at deployment. To account for finite-data approximation error in the learned safety boundary, we add a conformal prediction calibration step that adjusts the safety threshold and provides finite-sample probabilistic safety coverage. Empirically, SafeFQL trades modestly higher offline training cost for substantially lower inference latency than diffusion-style safe generative baselines, which is advantageous for real-time safety-critical deployment. Across boat navigation, and Safety Gymnasium MuJoCo tasks, SafeFQL matches or exceeds prior offline safe RL performance while substantially reducing constraint violations.