Search papers, labs, and topics across Lattice.
Department of Orthopedics, Xinqiao Hospital, Army Medical University, Chongqing, 400037, China, D example illustrating multi-modal expert demonstrations and trajectories predicted by different imitation policies. Behavioral cloning predictions collapse into a single mean. Discrete Policy succeeds but introduces temporal discontinuities. Generative Policy bounces between mode 1 and 2. Our work predicts consistent and fine-grained trajectory. Motivated by the above, we propose Primary-Fine Decoupling for Action Generation (PF-DAG), a two-stage imitation framework that explicitly separates primary mode selection from continuous action generation. Concretely, PF-DAG first learns a discrete vocabulary of primary modes and a lightweight policy that greedily selects a mode coherently. Then, we introduce a mode conditioned MeanFlow policy, which is a one-step continuous decoder to generate high-fidelity actions conditioned on the selected mode and the current observation. This explicit two-stage decomposition preserves intra-mode variations while reducing mode bouncing by enforcing stable primary choices. We validate PF-DAG with theoretical and empirical evidence. Among existing methods, single-stage generative policies (Chi et al., 2023; Zhao et al., 2023) are the most direct and competitive end-to-end approach for modeling continuous, multi-modal action distributions, so we focus our theoretical comparison on this family. Under realistic mode-variance assumptions we show that the two-stage design attains a no-higher optimal MSE lower bound than single-stage generative baselines, with a strict improvement whenever the inter-mode variance term is positive. Empirically we test PF-DAG across 56 simulation manipulation tasks (including high-DOF dexterous hands and low-DOF grippers) as well as on real world tactile dexterous manipulation. Results show consistent improvements in accuracy, stability, and sample efficiency compared to diffusion and flow-based baselines, and ablations quantify the contribution of key components. Together, these results suggest that explicitly decoupling coarse discrete decisions from fine-grained continuous generation yields practical and statistical advantages for closed-loop robotic imitation. 2 Related Work 2.1 Behavior Cloning Behavior cloning (BC) casts policy learning as supervised regression on demonstration data (Wang et al., 2017; Torabi et al., 2018; Mandlekar et al., 2021; Hu et al., 2024). In BC, a policy is trained to predict the expert’s action for each observed state, yielding a deterministic mapping from states to actions. This approach is highly sample-efficient in practice (e.g. for pick-and-place tasks), but it suffers from well-known limitations. In particular, BC policies tend to underfit multi-modal behavior (Mandlekar et al., 2021; Shafiullah et al., 2022; Florence et al., 2022; Chi et al., 2023) and also incur compounding errors at test time (Ross et al., 2011; Ke et al., 2021; Tu et al., 2022; Zhao et al., 2023). To mitigate these issues, recent work has explored more expressive BC models. Implicit BC and energy-based models learn an action-energy landscape per state and solve for actions by optimization (Florence et al., 2022), while mixture-density networks and latent-variable BC attempt to represent multi-modal distributions explicitly (Jang et al., 2022). 2.2 Discrete Policy Discretizing continuous robot actions is viewed as tokenization: converting a high-frequency, high-dimensional control signal into a sequence of discrete symbols so that standard sequence-modeling methods can be applied. Framing actions as tokens has two immediate benefits for manipulation imitation. First, next-token prediction over a discrete vocabulary represents multi-modal conditional action distributions without collapsing modes into a single mean. Second, sequence models bring powerful context modeling and scalable pretraining recipes from language and vision to control, enabling cross-task and cross-embodiment generalization when token vocabularies are shared or aligned. Recent Vision-Language-Action (VLA) efforts articulate this reframing and its practical advantages for large, generalist robot policies (Zitkovich et al., 2023; O’Neill et al., 2024; Kim et al., 2024; Zawalski et al., 2024; Wen et al., 2025; Black et al., 2024; Zheng et al., 2024; Zhen et al., 2024; Cheang et al., 2024; Duan et al., 2024; Zhao et al., 2025). Existing action tokenizers fall into a few broad families. The simplest and most commonly used approach maps each continuous action dimension at each step to one of a fixed set of bins (Brohan et al., 2022; Zitkovich et al., 2023; Kim et al., 2024). Frequency-space methods like FAST (Pertsch et al., 2025) departs from it and instead compresses action chunks using a time-series transform and lightweight quantization. Others use Vector Quantization (VQ) as latent tokenizers. VQ-based tokenizers learn a shared codebook of action atoms and quantize continuous latent representations to nearest codebook entries (Lee et al., 2024; Wang et al., 2025). While effective at capturing multi-modal action distributions, these approaches inherently trade off reconstruction fidelity for discrete simplicity. Our work differs by leveraging tokenization solely for high-level primary mode selection. 2.3 Generative Policy A large class of imitation methods treat policy generation as a stochastic generative problem by introducing latent variables. In this view, a policy is written as a=π(o,z)a=\pi(o,z) with zz sampled from a learned prior. This formulation naturally represents multi-modal conditional action distributions because sampling different zz values yields different valid actions for the same observation. Action Chunking with Transformers (ACT) (Zhao et al., 2023) is a sequence generator with Conditional Variational Autoencoder (CVAE) as backend. Diffusion Policy (DP) (Chi et al., 2023) treat action generation as conditional denoising. Starting from noise, the action is iteratively refined via a learned score or denoiser conditioned on observation. More recent normalizing-flow policies (Black et al., 2024; Hu et al., 2024; Zhang et al., 2025) provide tractable density estimation and efficient sampling while representing complex, multi-modal action distributions. Although generative policies represent multi-modal distributions, they often face mode bouncing (Chen et al., 2025), inference cost (Li et al., 2024), chunk trade-offs (Zhao et al., 2023). Other hierarchical approaches, such as Hierarchical Diffusion Policy (HDP) (Ma et al., 2024), also use a high-level policy to guide a low-level generator. However, HDP is designed to rely on explicit, task-specific heuristics like contact-point waypoints to define its hierarchy. In contrast, our PF-DAG learns its primary modes end-to-end directly from action-chunk clusters themselves, offering a more general abstraction not tied to predefined heuristics. Thus, we propose to combine the strengths of action tokenization with expressive generative decoders that handle the residual continuous variations. Our PF-DAG decouples the primary discrete mode selection from the fine-grained action generation and reduces mode bouncing while preserving continuous variations. 2.4 Hierarchical and Residual Policies Our work is also situated within the broader context of hierarchical and residual policies for robot learning (Rana et al., 2023; Cui et al., 2025; Kujanpää et al., 2023; Liang et al., 2024). These approaches commonly decompose the complex control problem into a high-level policy that selects a skill, sub-goal, or context, and a low-level policy that executes control conditioned on the high-level selection (Mete et al., 2024; Feng et al., 2024). For instance, some methods learn residual policies that adapt a base controller (Rana et al., 2023), while others focus on discovering discrete skills from demonstration data or language guidance (Chen et al., 2023; Wan et al., 2024; Tanneberg et al., 2021). While PF-DAG shares this general hierarchical structure, its primary motivation and technical design are distinct. Many hierarchical methods focus on long-horizon planning or unsupervised skill discovery. In contrast, PF-DAG is specifically designed to address the problem of mode bouncing inherent in single-stage generative policies when modeling multi-modal action distributions at a fine temporal scale. 3 PF-DAG Formulation and Design Figure 2: Overview of our PF-DAG framework. The input observation features are extracted via Observation Feature Extraction and then fed to the Primary Mode Policy π1\pi_{1}. The GT action chunks are compressed into discrete primary modes using VQ-VAE and supervise π1\pi_{1}, which are only used in training stage. The Mode Conditioned MeanFlow Policy π2\pi_{2} takes the selected primary mode mm and observation features as input, generating high-fidelity continuous actions. This section first defines the task formulation as a closed-loop action-sequence prediction problem, and then presents the three main components of our approach: i) Observation Feature Extraction, ii) a compact discrete representation learned with a Vector-Quantized VAE (VQ-VAE) (Van Den Oord et al., 2017) and a lightweight Primary Mode Policy that predicts those discrete modes, and iii) a mode conditioned one-step continuous decoder based on MeanFlow (Geng et al., 2025). Finally, we give a theoretical analysis that quantifies why a two-stage, coarse-to-fine decomposition reduces the MSE lower bound compared to single-stage generative models. 3.1 Closed-loop Action Sequence Prediction Similar to previous work (Chi et al., 2023; Black et al., 2024), we formulate the manipulation task as closed-loop action sequence prediction. Concretely, at time tt, the observation is 𝐨t=(𝐩t,𝐬t,𝐟t)\mathbf{o}_{t}=(\mathbf{p}_{t},\mathbf{s}_{t},\mathbf{f}_{t}), where 𝐩t\mathbf{p}_{t} denotes a fixed-size point cloud, 𝐬t∈ℝds\mathbf{s}_{t}\in\mathbb{R}^{d_{s}} denotes robot proprioception, 𝐟t∈ℝ
1
0
0
1
This consensus provides expert recommendations for DAA-HJA in elderly patients with FNF, addressing key clinical dilemmas and promoting standardized surgical techniques.