Xiaohan Lei

K\mathbf{C}=\{\mathbf{e}_{k}\in\mathbb{R}^{D}\}_{k=1}^{K} with codebook size KK. We choose KK to be small to capture coarse primary action prototypes and make primary policy easy to learn. Given an action chunk 𝐚\mathbf{a}, the encoder produces 𝐳e=Eϕ(𝐚)\mathbf{z}_{e}=E_{\phi}(\mathbf{a}) and we quantize it to the nearest codebook vector: k∗=argmink⁡‖𝐳e−𝐞k‖2,𝐳~=𝐞k∗,m:=k∗.k^{*}={\textstyle\mathrm{arg}\,\min_{k}}\|\mathbf{z}_{e}-\mathbf{e}_{k}\|_{2},\qquad\tilde{\mathbf{z}}=\mathbf{e}_{k^{*}},\qquad m:=k^{*}. (2) We define mm as the primary mode. Reconstruction is 𝐚^(m)=Dψ(𝐳~)\hat{\mathbf{a}}^{(m)}=D_{\psi}(\tilde{\mathbf{z}}). We train the VQ-VAE with the standard commitment and reconstruction terms: ℒVQ(𝐚)=‖𝐚−Dψ(𝐳~)‖22+‖sg[Eϕ(𝐚)]−𝐳~‖22+β‖Eϕ(𝐚)−sg[𝐳~]‖22,\mathcal{L}_{\text{VQ}}(\mathbf{a})\;=\;\|\mathbf{a}-D_{\psi}(\tilde{\mathbf{z}})\|_{2}^{2}\;+\;\|\text{sg}[E_{\phi}(\mathbf{a})]-\tilde{\mathbf{z}}\|_{2}^{2}\;+\;\beta\|E_{\phi}(\mathbf{a})-\text{sg}[\tilde{\mathbf{z}}]\|_{2}^{2}, (3) where sg[⋅]{\rm sg}[\cdot] denotes stop-gradient and β\beta is the commitment weight. The primary policy is a classifier π1(m∣𝐨)\pi_{1}(m\mid\mathbf{o}) trained to predict the VQ code mm from observation 𝐨\mathbf{o}. At test time, we select the discrete mode mm for the current chunk by choosing the highest predicted probability from π1\pi_{1}. Both encoder EϕE_{\phi} and decoder DψD_{\psi} are implemented as compact MLPs. Primary Mode Policy. The primary policy π1(m∣𝐨)\pi_{1}(m\mid\mathbf{o}) maps the shared observation embedding to a categorical distribution over the KK VQ bins. We implement π1\pi_{1} as a lightweight MLP classifier. During training π1\pi_{1} is optimized with a standard cross-entropy objective that matches the encoder-assigned VQ indices. At test time we use greedy mode selection for reliability. The separation of primary-mode selection as an explicit classifier drastically reduces coarse mode bouncing. 3.4 Mode Conditioned MeanFlow Policy After selecting a primary mode mm, we recover a high-quality continuous action chunk that respects the selected mode. To balance generation quality and real-time responsiveness, we use a one-step generative modeling inspired by MeanFlow (Geng et al., 2025). Instead of multi-step denoising iterations, a learned average velocity field predicts the displacement from noise to the desired action in one function evaluation. Let mm be the selected discrete mode and 𝐚^(m):=Dψ(𝐞m)\hat{\mathbf{a}}^{(m)}:=D_{\psi}(\mathbf{e}_{m}) be the VQ-decoder reconstruction of the mode. The role of the one-step generator is to produce a residual Δ𝐚\Delta\mathbf{a} conditioned on observation 𝐨\mathbf{o} and mode mm, such that the final action chunk is 𝐚^=𝐚^(m)+Δ𝐚\hat{\mathbf{a}}=\hat{\mathbf{a}}^{(m)}+\Delta\mathbf{a}. Mode and Observation Conditioned Average Velocity Field. Following MeanFlow (Geng et al., 2025), we implement the residual as an average velocity field 𝐯¯θ(𝐳r,τ,r;𝐨,m)\bar{\mathbf{v}}_{\theta}\big(\mathbf{z}_{r},\,\tau,\,r;\;\mathbf{o},m\big), where 𝐳r\mathbf{z}_{r} denotes a state on the interpolation path between noise sample and the target action, τ∈[0,1]\tau\in[0,1] is the interpolation start time, and r∈(0,1]r\in(0,1] is the end time. The MeanFlow field is trained to match the ground-truth average velocity over arbitrary intervals [τ,r][\tau,r], which is written as 𝐯¯∗(𝐳r,τ,r)=sg(d𝐳rdr−(r−τ)(d𝐳rdr∂𝐯¯θ∂𝐳+∂𝐯¯θ∂r)).\bar{\mathbf{v}}^{*}(\mathbf{z}_{r},\tau,r)=\text{sg}\!\Big(\tfrac{d\mathbf{z}_{r}}{dr}-(r-\tau)\Big(\tfrac{d\mathbf{z}_{r}}{dr}\tfrac{\partial\bar{\mathbf{v}}_{\theta}}{\partial\mathbf{z}}+\tfrac{\partial\bar{\mathbf{v}}_{\theta}}{\partial r}\Big)\Big). (4) The d𝐳rdr\tfrac{d\mathbf{z}_{r}}{dr} is the instantaneous velocity of 𝐳r\mathbf{z}_{r} at time rr. ∂𝐯¯θ∂𝐳\tfrac{\partial\bar{\mathbf{v}}_{\theta}}{\partial\mathbf{z}} describes how the average velocity responds to perturbations in the residual draft, and ∂𝐯¯θ∂r\tfrac{\partial\bar{\mathbf{v}}_{\theta}}{\partial r} captures how it evolves as the interpolation approaches the target residual. We train 𝐯¯θ\bar{\mathbf{v}}_{\theta} with squared-error objective that supervises the predicted average velocity. More detailed derivations of the formulation are provided in Appendix. Implementation Details. For backbone modeling we use a DiT-style transformer backbone (Peebles and Xie, 2023). Each action chunk is represented as a sequence of tokens. The time-related scalars τ\tau and rr are expanded via sinusoidal embeddings (Vaswani et al., 2017), which are added to observation embedding, as well as a learnable embedding of the discrete mode mm. During training, (τ,r)(\tau,r) is sampled from a uniform distribution and 𝐳0\mathbf{z}_{0} is from standard normal distribution. 3.5 Theoretical analysis With the two-stage architecture defined, we now provide a concise theoretical analysis that explains why this coarse-to-fine decomposition strictly reduces the minimum achievable MSE compared to single-stage generative predictors. Single-stage generative methods produce actions by sampling a latent code 𝐳∼𝒩(0,I)\mathbf{z}\sim\mathcal{N}(0,I) and decoding 𝐚^g=π(𝐨,𝐳)\hat{\mathbf{a}}_{g}=\pi(\mathbf{o},\mathbf{z}). Under the squared-error criterion, the best point estimate is the conditional expectation 𝐚^g∗(𝐨)=𝔼𝐳[π(𝐨,𝐳)]\hat{\mathbf{a}}_{g}^{*}(\mathbf{o})=\mathbb{E}_{\mathbf{z}}[\pi(\mathbf{o},\mathbf{z})]. The resulting expected MSE decomposes into an irreducible data variance term and a model bias: 𝔼𝐨,𝐚[∥𝐚−𝐚^g∗(𝐨)∥2]=𝔼𝐨[Var(𝐚∣𝐨)]+𝔼𝐨[∥𝔼[𝐚∣𝐨]−𝐚^g∗(𝐨)∥2].\mathbb{E}_{\mathbf{o},\mathbf{a}}\!\big[\|\mathbf{a}-\hat{\mathbf{a}}_{g}^{*}(\mathbf{o})\|^{2}\big]=\mathbb{E}_{\mathbf{o}}\big[\mathrm{Var}(\mathbf{a}\mid\mathbf{o})\big]+\mathbb{E}_{\mathbf{o}}\big[\|\mathbb{E}[\mathbf{a}\mid\mathbf{o}]-\hat{\mathbf{a}}_{g}^{*}(\mathbf{o})\|^{2}\big]. (5) When the model is unbiased the second term vanishes and the minimum achievable error equals 𝔼𝐨[Var(𝐚∣𝐨)]\mathbb{E}_{\mathbf{o}}[\mathrm{Var}(\mathbf{a}\mid\mathbf{o})]. In our two-stage scheme the primary stage selects a discrete mode m^(𝐨)\hat{m}(\mathbf{o}) and the second stage outputs 𝐚^(𝐨,m,𝐳)=π2(𝐨,m,𝐳)\hat{\mathbf{a}}(\mathbf{o},m,\mathbf{z})=\pi_{2}(\mathbf{o},m,\mathbf{z}). For any fixed (𝐨,m)(\mathbf{o},m), the optimal MSE predictor collapses the stochasticity in 𝐳\mathbf{z} to the conditional expectation 𝐚^∗(𝐨,m)=𝔼𝐳[π2(𝐨,m,𝐳)]\hat{\mathbf{a}}^{*}(\mathbf{o},m)=\mathbb{E}_{\mathbf{z}}[\pi_{2}(\mathbf{o},m,\mathbf{z})], yielding the irreducible residual 𝔼𝐨,m[Var(𝐚∣𝐨,m)]\mathbb{E}_{\mathbf{o},m}[\mathrm{Var}(\mathbf{a}\mid\mathbf{o},m)] when the model is unbiased. By the law of total variance, 𝔼𝐨,m[Var(𝐚∣𝐨,m)]=𝔼𝐨[Var(𝐚∣𝐨)]−𝔼𝐨[Varm∣𝐨(𝔼[𝐚∣𝐨,m])],\mathbb{E}_{\mathbf{o},m}\big[\mathrm{Var}(\mathbf{a}\mid\mathbf{o},m)\big]=\mathbb{E}_{\mathbf{o}}\big[\mathrm{Var}(\mathbf{a}\mid\mathbf{o})\big]-\mathbb{E}_{\mathbf{o}}\big[\mathrm{Var}_{m\mid\mathbf{o}}\!\big(\mathbb{E}[\mathbf{a}\mid\mathbf{o},m]\big)\big], (6) which is no greater than 𝔼𝐨[Var(𝐚∣𝐨)]\mathbb{E}_{\mathbf{o}}[\mathrm{Var}(\mathbf{a}\mid\mathbf{o})], and is strictly smaller whenever Varm∣𝐨(𝔼[𝐚∣𝐨,m])>0\mathrm{Var}_{m\mid\mathbf{o}}\!\big(\mathbb{E}[\mathbf{a}\mid\mathbf{o},m]\big)>0. Intuitively, discretizing into primary modes removes inter-mode variance from the residual error, lowering the MSE bound compared to single-stage latent samplers. 4 Experiments 4.1 Simulation Evaluation Adroit DexArt MetaWorld Method Hammer Door Pen Laptop Faucet Toilet Bucket Medium (6) Hard (5) Success IBC 0.00±0.000.00\scriptstyle{\pm 0.00} 0.00±0.000.00\scriptstyle{\pm 0.00} 0.10±0.010.10\scriptstyle{\pm 0.01} 0.01±0.010.01\scriptstyle{\pm 0.01} 0.07±0.020.07\scriptstyle{\pm 0.02} 0.15±0.010.15\scriptstyle{\pm 0.01} 0.00±0.000.00\scriptstyle{\pm 0.00} 0.11±0.020.11\scriptstyle{\pm 0.02} 0.09±0.030.09\scriptstyle{\pm 0.03} 0.08 BC-H 0.10±0.090.10\scriptstyle{\pm 0.09} 0.07±0.050.07\scriptstyle{\pm 0.05} 0.16±0.030.16\scriptstyle{\pm 0.03} 0.09±0.020.09\scriptstyle{\pm 0.02} 0.13±0.040.13\scriptstyle{\pm 0.04} 0.21±0.020.21\scriptstyle{\pm 0.02} 0.10±0.010.10\scriptstyle{\pm 0.01} 0.15±0.030.15\scriptstyle{\pm 0.03} 0.18±0.050.18\scriptstyle{\pm 0.05} 0.15 DP 0.48±0.170.48\scriptstyle{\pm 0.17} 0.50±0.050.50\scriptstyle{\pm 0.05} 0.25±0.040.25\scriptstyle{\pm 0.04} 0.69±0.040.69\scriptstyle{\pm 0.04} 0.23±0.080.23\scriptstyle{\pm 0.08} 0.58±0.020.58\scriptstyle{\pm 0.02} 0.46±0.010.46\scriptstyle{\pm 0.01} 0.20±0.050.20\scriptstyle{\pm 0.05} 0.19±0.030.19\scriptstyle{\pm 0.03} 0.30 DP3 1.00±0.00\mathbf{1.00\scriptstyle{\pm 0.00}} 0.62±0.040.62\scriptstyle{\pm 0.04} 0.43±0.060.43\scriptstyle{\pm 0.06} 0.83±0.010.83\scriptstyle{\pm 0.01} 0.63±0.020.63\scriptstyle{\pm 0.02} 0.82±0.04\mathbf{0.82\scriptstyle{\pm 0.04}} 0.46±0.020.46\scriptstyle{\pm 0.02} 0.45±0.050.45\scriptstyle{\pm 0.05} 0.35±0.020.35\scriptstyle{\pm 0.02} 0.51 FlowPolicy 1.00±0.00\mathbf{1.00\scriptstyle{\pm 0.00}} 0.58±0.050.58\scriptstyle{\pm 0.05} 0.53±0.120.53\scriptstyle{\pm 0.12} 0.85±0.020.85\scriptstyle{\pm 0.02} 0.42±0.100.42\scriptstyle{\pm 0.10} 0.80±0.050.80\scriptstyle{\pm 0.05} 0.39±0.060.39\scriptstyle{\pm 0.06} 0.47±0.070.47\scriptstyle{\pm 0.07} 0.37±0.070.37\scriptstyle{\pm 0.07} 0.51 PF-DAG (Ours) 1.00±0.00\mathbf{1.00\scriptstyle{\pm 0.00}} 0.65±0.03\mathbf{0.65\scriptstyle{\pm 0.03}} 0.65±0.01\mathbf{0.65\scriptstyle{\pm 0.01}} 0.90±0.02\mathbf{0.90\scriptstyle{\pm 0.02}} 0.72±0.05\mathbf{0.72\scriptstyle{\pm 0.05}} 0.82±0.02\mathbf{0.82\scriptstyle{\pm 0.02}} 0.47±0.02\mathbf{0.47\scriptstyle{\pm 0.02}} 0.68±0.04\mathbf{0.68\scriptstyle{\pm 0.04}} 0.72±0.03\mathbf{0.72\scriptstyle{\pm 0.03}} 0.72\mathbf{0.72} Table 1: Quantitative comparison of PF-DAG against state-of-the-art baselines on 18 tasks from three simulation benchmarks. Benchmarks and Datasets. We evaluate our method on manipulation benchmarks that cover a broad range of control domains. We use Adroit (Rajeswaran et al., 2017), DexArt (Bao et al., 2023) and MetaWorld (Yu et al., 2020) as our simulation benchmarks. These are implemented on physics engines like MuJoCo (Todorov et al., 2012) and IsaacGym (Makoviychuk et al., 2021). For fair comparison we adopt the same task splits and data collection pipelines as in prior work (Ze et al., 2024): Adroit tasks with high-dimensional Shadow hand and MetaWorld with low-dimensional gripper are trained with 10 expert demos per task, while DexArt with Allegro hand uses 90 expert demos. Demonstrations are collected using scripted policies for MetaWorld tasks, and RL-trained expert agents (Wang et al., 2022; Schulman et al., 2017) for Adroit and DexArt. Each experiment is run with three random seeds. For each seed we evaluate the policy for 20 episodes every 200 training epochs and then compute the average of the top-5 highest success rates (Ze et al., 2024). The final metric is the mean and standard deviation across the three seeds. Experiment Setup. All networks are optimized with AdamW (Loshchilov and Hutter, 2017). We apply a short linear warmup followed by cosine decay for the learning rate. Training proceeds in stages: first we pretrain the VQ-VAE to learn compact primary prototypes; then we freeze the codebook and jointly train the Primary Mode Policy π1\pi_{1} (cross-entropy to the VQ indices) and the mode-conditioned MeanFlow generator v¯θ\bar{v}_{\theta} (squared-error supervision on sampled (τ,r)(\tau,r) intervals). At inference we set (τ,r)=(0,1)(\tau,r)=(0,1) for one-step continuous action chunk generation. Baselines. We compare against the following representative baselines. Implicit Behavioral Cloning (IBC) (Florence et al., 2022) serves as a representative implicit BC method. BC-H (Foster et al., 2024) represents non-generative approaches for mitigating mode instability. Diffusion Policy (DP) (Chi et al., 2023) pioneers the original formulation of image-conditioned diffusion-based policies. While

Papers on Lattice

Total citations

Topics

h-index

Publication activitypapers/week, last 8 weeks

Research focus

Architecture Design (Transformers, SSMs, MoE) (1)Computer Vision (1)Robotics & Embodied AI (1)

Frequent co-authors

Min Wang (1)Bohong Weng (1)Weng Zhou (1)Wengang Zhou (1)

Papers (1)

Mar 4, 2026

Mar 4, 2026·also Cohere

Structural Action Transformer for 3D Dexterous Manipulation

Robots can now learn dexterous manipulation skills across different hand designs, thanks to a new Transformer architecture that treats actions as a flexible arrangement of joint movements, rather than a fixed sequence.

Xiaohan Lei, Min Wang, Bohong Weng +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

Search

Xiaohan Lei

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (1)