Mar 5, 2026arXiv:2603.05296

Latent Policy Steering through One-Step Flow Policies

Hokyun Im, A. Kolobov, Andrey Kolobov, Jianlong Fu, Youngwoon Lee

AI Summary

Latent Policy Steering (LPS) is introduced to improve offline reinforcement learning by directly backpropagating action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor, avoiding the information loss associated with latent-space critics. This approach decouples return maximization from behavioral constraints by using the MeanFlow policy as a behavior-constrained generative prior. Experiments on OGBench and real-world robotic tasks demonstrate that LPS achieves state-of-the-art performance with minimal hyperparameter tuning, outperforming behavioral cloning and other latent steering methods.

Key Contribution

Ditching latent critics in offline RL unlocks state-of-the-art performance by directly backpropagating action-space gradients through a differentiable flow-based policy, enabling robust latent policy steering with minimal tuning.

Abstract

Offline reinforcement learning (RL) allows robots to learn from offline datasets without risky exploration. Yet, offline RL's performance often hinges on a brittle trade-off between (1) return maximization, which can push policies outside the dataset support, and (2) behavioral constraints, which typically require sensitive hyperparameter tuning. Latent steering offers a structural way to stay within the dataset support during RL, but existing offline adaptations commonly approximate action values using latent-space critics learned via indirect distillation, which can lose information and hinder convergence. We propose Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor. By eliminating proxy latent critics, LPS allows an original-action-space critic to guide end-to-end latent-space optimization, while the one-step MeanFlow policy serves as a behavior-constrained generative prior. This decoupling yields a robust method that works out-of-the-box with minimal tuning. Across OGBench and real-world robotic tasks, LPS achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong latent steering baselines.

Robotics & Embodied AI Training Efficiency & Optimization World Models & Planning

Citation Metrics

Citations0

Influential citations0

References31

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Latent Policy Steering through One-Step Flow Policies

Related Papers