Hokyun Im

M gradient steps with batch size 256256. We use a 44-layer MLP with hidden size 512512 for the base policies and 256256 for the critics. For the latent actors in LPS and DSRL, we use a 22-layer MLP with hidden size 256256. We carefully tune α\alpha for QC-FQL and QC-MFQL. For CFGRL, we use the best-reported CFG strength ww from [6]. Following common practice, we normalize the critic loss to have unit norm. V-B Experimental Results Figure 5: Performance on OGBench. We evaluate the success rates across tasks. Bars report the mean success rate over 33 seeds, and error bars indicate the 9595% confidence interval estimated using bootstrap resampling with

Papers on Lattice

Total citations

Topics

h-index

Research focus

Robotics & Embodied AI (1)Training Efficiency & Optimization (1)World Models & Planning (1)

Frequent co-authors

A. Kolobov (1)Andrey Kolobov (1)Jianlong Fu (1)Youngwoon Lee (1)

Papers (1)

Mar 5, 2026

Microsoft ResearchMar 5, 2026

Latent Policy Steering through One-Step Flow Policies

Ditching latent critics in offline RL unlocks state-of-the-art performance by directly backpropagating action-space gradients through a differentiable flow-based policy, enabling robust latent policy steering with minimal tuning.

Hokyun Im, A. Kolobov, Andrey Kolobov +2

Robotics & Embodied AI Training Efficiency & Optimization World Models & Planning

Search

Hokyun Im

Research focus

Frequent co-authors

Papers (1)