Search papers, labs, and topics across Lattice.
Latent Policy Steering (LPS) is introduced to improve offline reinforcement learning by directly backpropagating action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor, avoiding the information loss associated with latent-space critics. This approach decouples return maximization from behavioral constraints by using the MeanFlow policy as a behavior-constrained generative prior. Experiments on OGBench and real-world robotic tasks demonstrate that LPS achieves state-of-the-art performance with minimal hyperparameter tuning, outperforming behavioral cloning and other latent steering methods.
Ditching latent critics in offline RL unlocks state-of-the-art performance by directly backpropagating action-space gradients through a differentiable flow-based policy, enabling robust latent policy steering with minimal tuning.
Offline reinforcement learning (RL) allows robots to learn from offline datasets without risky exploration. Yet, offline RL's performance often hinges on a brittle trade-off between (1) return maximization, which can push policies outside the dataset support, and (2) behavioral constraints, which typically require sensitive hyperparameter tuning. Latent steering offers a structural way to stay within the dataset support during RL, but existing offline adaptations commonly approximate action values using latent-space critics learned via indirect distillation, which can lose information and hinder convergence. We propose Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor. By eliminating proxy latent critics, LPS allows an original-action-space critic to guide end-to-end latent-space optimization, while the one-step MeanFlow policy serves as a behavior-constrained generative prior. This decoupling yields a robust method that works out-of-the-box with minimal tuning. Across OGBench and real-world robotic tasks, LPS achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong latent steering baselines.