Jianlong Fu

D toy task with reward concentrated in the top-right corner reveal a pattern: large α\alpha yields overly conservative policies, while small α\alpha encourages out-of-support actions. Behavior regularization is sensitive to α\alpha. The way behavior-regularized offline RL methods balance value maximization and behavioral adherence – using a weighting hyperparameter α\alpha in Eq.˜2 – can be fragile even in simple settings. Figure˜2 provides an example: large α\alpha yields overly conservative policies, while small α\alpha encourages out-of-support actions. In general, appropriate α\alpha can vary substantially with reward scale and task characteristics. As a result, methods that rely on it often require task-specific hyperparameter sweeps, which is impractical in real-world deployments. Figure 3: Comparing action space Q-value and distilled latent-space Q-value. Left to right: (1) dataset distribution with reward intensity; (2) action-space Q-value Qϕ(s,a)Q_{\phi}(s,a) projected into the latent space; (3) learned latent Q-value Qϕ(s,z)Q_{\phi}(s,z); (4) cosine similarity between the gradients in (2) and (3). Distilled latent critics can provide poor gradients. Latent steering method (e.g., DSRL) optimize latents by relying on a value function defined in the latent space. In the offline setting, this is typically obtained by distilling the action-space critic through the frozen decoder, i.e., minϕ⁡𝔼[|Qϕ(s,z)−Qθ(πβ(s,z))|2]\min_{\phi}\mathbb{E}\left[\lvert Q_{\phi}(s,z)-Q_{\theta}(\pi_{\beta}(s,z))\rvert^{2}\right]. However, matching values does not guarantee that the latent gradients used for improvement are accurate. As illustrated in Figure˜3, even when QϕQ_{\phi} approximates values reasonably, its gradient direction ∇zQϕ(s,z)\nabla_{z}Q_{\phi}(s,z) can deviate substantially from the gradient of the action-space critic ∇zQθ(s,πβ(z))\nabla_{z}Q_{\theta}(s,\pi_{\beta}(z)), particularly near sharp boundaries of the data manifold. Such gradient mismatch can lead to suboptimal latent updates and degrade purely-offline performance. IV Latent Policy Steering (LPS) We propose Latent Policy Steering (LPS), which addresses both of the above limitations. First, LPS avoids explicit behavior-regularization trade-off by separating reward maximization and distributional constraints: a fixed generative behavior policy defines the support, while a latent actor performs value-driven steering (resolving α\alpha-sensitivity). Second, LPS eliminates proxy latent critics by directly backpropagating action-space critic gradients through a differentiable generative base policy to update the latent actor (avoiding the inaccurate latent critic). We instantiate LPS using three key components: a differentiable one-step base policy (Section˜IV-A), a spherical latent geometry (Section˜IV-B), and a direct latent optimization objective (Section˜IV-C). IV-A Differentiable Base Policy via MeanFlow The first component is the base policy πβ:𝒵×𝒮→𝒜\pi_{\beta}:\mathcal{Z}\times\mathcal{S}\to\mathcal{A}, which defines the “safe manifold” or the support of the dataset. While DSRL treats the base policy as a black box, LPS treats it as a differentiable mapping. This allows us to backpropagate gradients from the action-space critic to the latent actor through πβ\pi_{\beta} directly. However, a practical obstacle is that standard diffusion or flow-matching policies typically require iterative sampling, making end-to-end backpropagation expensive and unstable. We therefore employ MeanFlow for the base policy, which enables efficient one-step deterministic generation. Noise-to-action reformulation. In the original MeanFlow formulation, samples are produced by applying a learned displacement to latent noise. Early in training, errors in the displacement filed can amplify output variance, which in turn destabilizes the critic gradients used for steering. Following recent practice [15, 25], we use a noise-to-action reformulation in which πβ\pi_{\beta} directly predicts the denoised action (or action chunk) rather than the displacement. Concretely, we write the implied mean velocity uβu_{\beta} and its time derivative as residual quantities: uβ(zt,r,t)=zt−πβ(zt,r,t),duβdt=v−dπβdt.u_{\beta}(z_{t},r,t)=z_{t}-\pi_{\beta}(z_{t},r,t),\quad\frac{\mathrm{d}u_{\beta}}{\mathrm{d}t}=v-\frac{\mathrm{d}\pi_{\beta}}{\mathrm{d}t}. (7) Substituting Eq.˜7 into the MeanFlow training objective Eq.˜5 yields a numerically more stable training procedure by grounding the training in the action space. IV-B Spherical Latent Geometry Given the base policy (mapping) πβ\pi_{\beta}, we next define the latent space 𝒵sphere\mathcal{Z}_{\mathrm{sphere}} where the latent actor operates. A known failure mode with unconstrained Gaussian latents is the “norm explosion” problem. Because the latent actor is optimized to increase value without explicit bounds, it may increase |z|\lvert z\rvert to query latents that are atypical under the base policy prior, leading to out-of-distribution decoding and unstable learning. To address this, we leverage the concentration of measure property of high-dimensional Gaussians: for ϵ∼𝒩(𝟎,𝐈d)\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{d}), most probability mass concentrates on a thin shell of radius d\sqrt{d} [23]. This suggests treating the “typical set” of the base policy as naturally spherical. Therefore, we synchronize the support of the base policy and latent actor’s output lϕ(s)l_{\phi}(s) by constraining both to the hypersphere Sd−

Papers on Lattice

Total citations

Topics

h-index

Research focus

Robotics & Embodied AI (1)Training Efficiency & Optimization (1)World Models & Planning (1)

Frequent co-authors

Hokyun Im (1)A. Kolobov (1)Andrey Kolobov (1)Youngwoon Lee (1)

Papers (1)

Mar 5, 2026

Microsoft ResearchMar 5, 2026

Latent Policy Steering through One-Step Flow Policies

Ditching latent critics in offline RL unlocks state-of-the-art performance by directly backpropagating action-space gradients through a differentiable flow-based policy, enabling robust latent policy steering with minimal tuning.

Hokyun Im, A. Kolobov, Andrey Kolobov +2

Robotics & Embodied AI Training Efficiency & Optimization World Models & Planning

Search

Jianlong Fu

Research focus

Frequent co-authors

Papers (1)