Search papers, labs, and topics across Lattice.
K\mathbf{C}=\{\mathbf{e}_{k}\in\mathbb{R}^{D}\}_{k=1}^{K} with codebook size KK. We choose KK to be small to capture coarse primary action prototypes and make primary policy easy to learn. Given an action chunk 𝐚\mathbf{a}, the encoder produces 𝐳e=Eϕ(𝐚)\mathbf{z}_{e}=E_{\phi}(\mathbf{a}) and we quantize it to the nearest codebook vector: k∗=argmink‖𝐳e−𝐞k‖2,𝐳~=𝐞k∗,m:=k∗.k^{*}={\textstyle\mathrm{arg}\,\min_{k}}\|\mathbf{z}_{e}-\mathbf{e}_{k}\|_{2},\qquad\tilde{\mathbf{z}}=\mathbf{e}_{k^{*}},\qquad m:=k^{*}. (2) We define mm as the primary mode. Reconstruction is 𝐚^(m)=Dψ(𝐳~)\hat{\mathbf{a}}^{(m)}=D_{\psi}(\tilde{\mathbf{z}}). We train the VQ-VAE with the standard commitment and reconstruction terms: ℒVQ(𝐚)=‖𝐚−Dψ(𝐳~)‖22+‖sg[Eϕ(𝐚)]−𝐳~‖22+β‖Eϕ(𝐚)−sg[𝐳~]‖22,\mathcal{L}_{\text{VQ}}(\mathbf{a})\;=\;\|\mathbf{a}-D_{\psi}(\tilde{\mathbf{z}})\|_{2}^{2}\;+\;\|\text{sg}[E_{\phi}(\mathbf{a})]-\tilde{\mathbf{z}}\|_{2}^{2}\;+\;\beta\|E_{\phi}(\mathbf{a})-\text{sg}[\tilde{\mathbf{z}}]\|_{2}^{2}, (3) where sg[⋅]{\rm sg}[\cdot] denotes stop-gradient and β\beta is the commitment weight. The primary policy is a classifier π1(m∣𝐨)\pi_{1}(m\mid\mathbf{o}) trained to predict the VQ code mm from observation 𝐨\mathbf{o}. At test time, we select the discrete mode mm for the current chunk by choosing the highest predicted probability from π1\pi_{1}. Both encoder EϕE_{\phi} and decoder DψD_{\psi} are implemented as compact MLPs. Primary Mode Policy. The primary policy π1(m∣𝐨)\pi_{1}(m\mid\mathbf{o}) maps the shared observation embedding to a categorical distribution over the KK VQ bins. We implement π1\pi_{1} as a lightweight MLP classifier. During training π1\pi_{1} is optimized with a standard cross-entropy objective that matches the encoder-assigned VQ indices. At test time we use greedy mode selection for reliability. The separation of primary-mode selection as an explicit classifier drastically reduces coarse mode bouncing. 3.4 Mode Conditioned MeanFlow Policy After selecting a primary mode mm, we recover a high-quality continuous action chunk that respects the selected mode. To balance generation quality and real-time responsiveness, we use a one-step generative modeling inspired by MeanFlow (Geng et al., 2025). Instead of multi-step denoising iterations, a learned average velocity field predicts the displacement from noise to the desired action in one function evaluation. Let mm be the selected discrete mode and 𝐚^(m):=Dψ(𝐞m)\hat{\mathbf{a}}^{(m)}:=D_{\psi}(\mathbf{e}_{m}) be the VQ-decoder reconstruction of the mode. The role of the one-step generator is to produce a residual Δ𝐚\Delta\mathbf{a} conditioned on observation 𝐨\mathbf{o} and mode mm, such that the final action chunk is 𝐚^=𝐚^(m)+Δ𝐚\hat{\mathbf{a}}=\hat{\mathbf{a}}^{(m)}+\Delta\mathbf{a}. Mode and Observation Conditioned Average Velocity Field. Following MeanFlow (Geng et al., 2025), we implement the residual as an average velocity field 𝐯¯θ(𝐳r,τ,r;𝐨,m)\bar{\mathbf{v}}_{\theta}\big(\mathbf{z}_{r},\,\tau,\,r;\;\mathbf{o},m\big), where 𝐳r\mathbf{z}_{r} denotes a state on the interpolation path between noise sample and the target action, τ∈[0,1]\tau\in[0,1] is the interpolation start time, and r∈(0,1]r\in(0,1] is the end time. The MeanFlow field is trained to match the ground-truth average velocity over arbitrary intervals [τ,r][\tau,r], which is written as 𝐯¯∗(𝐳r,τ,r)=sg(d𝐳rdr−(r−τ)(d𝐳rdr∂𝐯¯θ∂𝐳+∂𝐯¯θ∂r)).\bar{\mathbf{v}}^{*}(\mathbf{z}_{r},\tau,r)=\text{sg}\!\Big(\tfrac{d\mathbf{z}_{r}}{dr}-(r-\tau)\Big(\tfrac{d\mathbf{z}_{r}}{dr}\tfrac{\partial\bar{\mathbf{v}}_{\theta}}{\partial\mathbf{z}}+\tfrac{\partial\bar{\mathbf{v}}_{\theta}}{\partial r}\Big)\Big). (4) The d𝐳rdr\tfrac{d\mathbf{z}_{r}}{dr} is the instantaneous velocity of 𝐳r\mathbf{z}_{r} at time rr. ∂𝐯¯θ∂𝐳\tfrac{\partial\bar{\mathbf{v}}_{\theta}}{\partial\mathbf{z}} describes how the average velocity responds to perturbations in the residual draft, and ∂𝐯¯θ∂r\tfrac{\partial\bar{\mathbf{v}}_{\theta}}{\partial r} captures how it evolves as the interpolation approaches the target residual. We train 𝐯¯θ\bar{\mathbf{v}}_{\theta} with squared-error objective that supervises the predicted average velocity. More detailed derivations of the formulation are provided in Appendix. Implementation Details. For backbone modeling we use a DiT-style transformer backbone (Peebles and Xie, 2023). Each action chunk is represented as a sequence of tokens. The time-related scalars τ\tau and rr are expanded via sinusoidal embeddings (Vaswani et al., 2017), which are added to observation embedding, as well as a learnable embedding of the discrete mode mm. During training, (τ,r)(\tau,r) is sampled from a uniform distribution and 𝐳0\mathbf{z}_{0} is from standard normal distribution. 3.5 Theoretical analysis With the two-stage architecture defined, we now provide a concise theoretical analysis that explains why this coarse-to-fine decomposition strictly reduces the minimum achievable MSE compared to single-stage generative predictors. Single-stage generative methods produce actions by sampling a latent code 𝐳∼𝒩(0,I)\mathbf{z}\sim\mathcal{N}(0,I) and decoding 𝐚^g=π(𝐨,𝐳)\hat{\mathbf{a}}_{g}=\pi(\mathbf{o},\mathbf{z}). Under the squared-error criterion, the best point estimate is the conditional expectation 𝐚^g∗(𝐨)=𝔼𝐳[π(𝐨,𝐳)]\hat{\mathbf{a}}_{g}^{*}(\mathbf{o})=\mathbb{E}_{\mathbf{z}}[\pi(\mathbf{o},\mathbf{z})]. The resulting expected MSE decomposes into an irreducible data variance term and a model bias: 𝔼𝐨,𝐚[∥𝐚−𝐚^g∗(𝐨)∥2]=𝔼𝐨[Var(𝐚∣𝐨)]+𝔼𝐨[∥𝔼[𝐚∣𝐨]−𝐚^g∗(𝐨)∥2].\mathbb{E}_{\mathbf{o},\mathbf{a}}\!\big[\|\mathbf{a}-\hat{\mathbf{a}}_{g}^{*}(\mathbf{o})\|^{2}\big]=\mathbb{E}_{\mathbf{o}}\big[\mathrm{Var}(\mathbf{a}\mid\mathbf{o})\big]+\mathbb{E}_{\mathbf{o}}\big[\|\mathbb{E}[\mathbf{a}\mid\mathbf{o}]-\hat{\mathbf{a}}_{g}^{*}(\mathbf{o})\|^{2}\big]. (5) When the model is unbiased the second term vanishes and the minimum achievable error equals 𝔼𝐨[Var(𝐚∣𝐨)]\mathbb{E}_{\mathbf{o}}[\mathrm{Var}(\mathbf{a}\mid\mathbf{o})]. In our two-stage scheme the primary stage selects a discrete mode m^(𝐨)\hat{m}(\mathbf{o}) and the second stage outputs 𝐚^(𝐨,m,𝐳)=π2(𝐨,m,𝐳)\hat{\mathbf{a}}(\mathbf{o},m,\mathbf{z})=\pi_{2}(\mathbf{o},m,\mathbf{z}). For any fixed (𝐨,m)(\mathbf{o},m), the optimal MSE predictor collapses the stochasticity in 𝐳\mathbf{z} to the conditional expectation 𝐚^∗(𝐨,m)=𝔼𝐳[π2(𝐨,m,𝐳)]\hat{\mathbf{a}}^{*}(\mathbf{o},m)=\mathbb{E}_{\mathbf{z}}[\pi_{2}(\mathbf{o},m,\mathbf{z})], yielding the irreducible residual 𝔼𝐨,m[Var(𝐚∣𝐨,m)]\mathbb{E}_{\mathbf{o},m}[\mathrm{Var}(\mathbf{a}\mid\mathbf{o},m)] when the model is unbiased. By the law of total variance, 𝔼𝐨,m[Var(𝐚∣𝐨,m)]=𝔼𝐨[Var(𝐚∣𝐨)]−𝔼𝐨[Varm∣𝐨(𝔼[𝐚∣𝐨,m])],\mathbb{E}_{\mathbf{o},m}\big[\mathrm{Var}(\mathbf{a}\mid\mathbf{o},m)\big]=\mathbb{E}_{\mathbf{o}}\big[\mathrm{Var}(\mathbf{a}\mid\mathbf{o})\big]-\mathbb{E}_{\mathbf{o}}\big[\mathrm{Var}_{m\mid\mathbf{o}}\!\big(\mathbb{E}[\mathbf{a}\mid\mathbf{o},m]\big)\big], (6) which is no greater than 𝔼𝐨[Var(𝐚∣𝐨)]\mathbb{E}_{\mathbf{o}}[\mathrm{Var}(\mathbf{a}\mid\mathbf{o})], and is strictly smaller whenever Varm∣𝐨(𝔼[𝐚∣𝐨,m])>0\mathrm{Var}_{m\mid\mathbf{o}}\!\big(\mathbb{E}[\mathbf{a}\mid\mathbf{o},m]\big)>0. Intuitively, discretizing into primary modes removes inter-mode variance from the residual error, lowering the MSE bound compared to single-stage latent samplers. 4 Experiments 4.1 Simulation Evaluation Adroit DexArt MetaWorld Method Hammer Door Pen Laptop Faucet Toilet Bucket Medium (6) Hard (5) Success IBC 0.00±0.000.00\scriptstyle{\pm 0.00} 0.00±0.000.00\scriptstyle{\pm 0.00} 0.10±0.010.10\scriptstyle{\pm 0.01} 0.01±0.010.01\scriptstyle{\pm 0.01} 0.07±0.020.07\scriptstyle{\pm 0.02} 0.15±0.010.15\scriptstyle{\pm 0.01} 0.00±0.000.00\scriptstyle{\pm 0.00} 0.11±0.020.11\scriptstyle{\pm 0.02} 0.09±0.030.09\scriptstyle{\pm 0.03} 0.08 BC-H 0.10±0.090.10\scriptstyle{\pm 0.09} 0.07±0.050.07\scriptstyle{\pm 0.05} 0.16±0.030.16\scriptstyle{\pm 0.03} 0.09±0.020.09\scriptstyle{\pm 0.02} 0.13±0.040.13\scriptstyle{\pm 0.04} 0.21±0.020.21\scriptstyle{\pm 0.02} 0.10±0.010.10\scriptstyle{\pm 0.01} 0.15±0.030.15\scriptstyle{\pm 0.03} 0.18±0.050.18\scriptstyle{\pm 0.05} 0.15 DP 0.48±0.170.48\scriptstyle{\pm 0.17} 0.50±0.050.50\scriptstyle{\pm 0.05} 0.25±0.040.25\scriptstyle{\pm 0.04} 0.69±0.040.69\scriptstyle{\pm 0.04} 0.23±0.080.23\scriptstyle{\pm 0.08} 0.58±0.020.58\scriptstyle{\pm 0.02} 0.46±0.010.46\scriptstyle{\pm 0.01} 0.20±0.050.20\scriptstyle{\pm 0.05} 0.19±0.030.19\scriptstyle{\pm 0.03} 0.30 DP3 1.00±0.00\mathbf{1.00\scriptstyle{\pm 0.00}} 0.62±0.040.62\scriptstyle{\pm 0.04} 0.43±0.060.43\scriptstyle{\pm 0.06} 0.83±0.010.83\scriptstyle{\pm 0.01} 0.63±0.020.63\scriptstyle{\pm 0.02} 0.82±0.04\mathbf{0.82\scriptstyle{\pm 0.04}} 0.46±0.020.46\scriptstyle{\pm 0.02} 0.45±0.050.45\scriptstyle{\pm 0.05} 0.35±0.020.35\scriptstyle{\pm 0.02} 0.51 FlowPolicy 1.00±0.00\mathbf{1.00\scriptstyle{\pm 0.00}} 0.58±0.050.58\scriptstyle{\pm 0.05} 0.53±0.120.53\scriptstyle{\pm 0.12} 0.85±0.020.85\scriptstyle{\pm 0.02} 0.42±0.100.42\scriptstyle{\pm 0.10} 0.80±0.050.80\scriptstyle{\pm 0.05} 0.39±0.060.39\scriptstyle{\pm 0.06} 0.47±0.070.47\scriptstyle{\pm 0.07} 0.37±0.070.37\scriptstyle{\pm 0.07} 0.51 PF-DAG (Ours) 1.00±0.00\mathbf{1.00\scriptstyle{\pm 0.00}} 0.65±0.03\mathbf{0.65\scriptstyle{\pm 0.03}} 0.65±0.01\mathbf{0.65\scriptstyle{\pm 0.01}} 0.90±0.02\mathbf{0.90\scriptstyle{\pm 0.02}} 0.72±0.05\mathbf{0.72\scriptstyle{\pm 0.05}} 0.82±0.02\mathbf{0.82\scriptstyle{\pm 0.02}} 0.47±0.02\mathbf{0.47\scriptstyle{\pm 0.02}} 0.68±0.04\mathbf{0.68\scriptstyle{\pm 0.04}} 0.72±0.03\mathbf{0.72\scriptstyle{\pm 0.03}} 0.72\mathbf{0.72} Table 1: Quantitative comparison of PF-DAG against state-of-the-art baselines on 18 tasks from three simulation benchmarks. Benchmarks and Datasets. We evaluate our method on manipulation benchmarks that cover a broad range of control domains. We use Adroit (Rajeswaran et al., 2017), DexArt (Bao et al., 2023) and MetaWorld (Yu et al., 2020) as our simulation benchmarks. These are implemented on physics engines like MuJoCo (Todorov et al., 2012) and IsaacGym (Makoviychuk et al., 2021). For fair comparison we adopt the same task splits and data collection pipelines as in prior work (Ze et al., 2024): Adroit tasks with high-dimensional Shadow hand and MetaWorld with low-dimensional gripper are trained with 10 expert demos per task, while DexArt with Allegro hand uses 90 expert demos. Demonstrations are collected using scripted policies for MetaWorld tasks, and RL-trained expert agents (Wang et al., 2022; Schulman et al., 2017) for Adroit and DexArt. Each experiment is run with three random seeds. For each seed we evaluate the policy for 20 episodes every 200 training epochs and then compute the average of the top-5 highest success rates (Ze et al., 2024). The final metric is the mean and standard deviation across the three seeds. Experiment Setup. All networks are optimized with AdamW (Loshchilov and Hutter, 2017). We apply a short linear warmup followed by cosine decay for the learning rate. Training proceeds in stages: first we pretrain the VQ-VAE to learn compact primary prototypes; then we freeze the codebook and jointly train the Primary Mode Policy π1\pi_{1} (cross-entropy to the VQ indices) and the mode-conditioned MeanFlow generator v¯θ\bar{v}_{\theta} (squared-error supervision on sampled (τ,r)(\tau,r) intervals). At inference we set (τ,r)=(0,1)(\tau,r)=(0,1) for one-step continuous action chunk generation. Baselines. We compare against the following representative baselines. Implicit Behavioral Cloning (IBC) (Florence et al., 2022) serves as a representative implicit BC method. BC-H (Foster et al., 2024) represents non-generative approaches for mitigating mode instability. Diffusion Policy (DP) (Chi et al., 2023) pioneers the original formulation of image-conditioned diffusion-based policies. While, D example illustrating multi-modal expert demonstrations and trajectories predicted by different imitation policies. Behavioral cloning predictions collapse into a single mean. Discrete Policy succeeds but introduces temporal discontinuities. Generative Policy bounces between mode 1 and 2. Our work predicts consistent and fine-grained trajectory. Motivated by the above, we propose Primary-Fine Decoupling for Action Generation (PF-DAG), a two-stage imitation framework that explicitly separates primary mode selection from continuous action generation. Concretely, PF-DAG first learns a discrete vocabulary of primary modes and a lightweight policy that greedily selects a mode coherently. Then, we introduce a mode conditioned MeanFlow policy, which is a one-step continuous decoder to generate high-fidelity actions conditioned on the selected mode and the current observation. This explicit two-stage decomposition preserves intra-mode variations while reducing mode bouncing by enforcing stable primary choices. We validate PF-DAG with theoretical and empirical evidence. Among existing methods, single-stage generative policies (Chi et al., 2023; Zhao et al., 2023) are the most direct and competitive end-to-end approach for modeling continuous, multi-modal action distributions, so we focus our theoretical comparison on this family. Under realistic mode-variance assumptions we show that the two-stage design attains a no-higher optimal MSE lower bound than single-stage generative baselines, with a strict improvement whenever the inter-mode variance term is positive. Empirically we test PF-DAG across 56 simulation manipulation tasks (including high-DOF dexterous hands and low-DOF grippers) as well as on real world tactile dexterous manipulation. Results show consistent improvements in accuracy, stability, and sample efficiency compared to diffusion and flow-based baselines, and ablations quantify the contribution of key components. Together, these results suggest that explicitly decoupling coarse discrete decisions from fine-grained continuous generation yields practical and statistical advantages for closed-loop robotic imitation. 2 Related Work 2.1 Behavior Cloning Behavior cloning (BC) casts policy learning as supervised regression on demonstration data (Wang et al., 2017; Torabi et al., 2018; Mandlekar et al., 2021; Hu et al., 2024). In BC, a policy is trained to predict the expert’s action for each observed state, yielding a deterministic mapping from states to actions. This approach is highly sample-efficient in practice (e.g. for pick-and-place tasks), but it suffers from well-known limitations. In particular, BC policies tend to underfit multi-modal behavior (Mandlekar et al., 2021; Shafiullah et al., 2022; Florence et al., 2022; Chi et al., 2023) and also incur compounding errors at test time (Ross et al., 2011; Ke et al., 2021; Tu et al., 2022; Zhao et al., 2023). To mitigate these issues, recent work has explored more expressive BC models. Implicit BC and energy-based models learn an action-energy landscape per state and solve for actions by optimization (Florence et al., 2022), while mixture-density networks and latent-variable BC attempt to represent multi-modal distributions explicitly (Jang et al., 2022). 2.2 Discrete Policy Discretizing continuous robot actions is viewed as tokenization: converting a high-frequency, high-dimensional control signal into a sequence of discrete symbols so that standard sequence-modeling methods can be applied. Framing actions as tokens has two immediate benefits for manipulation imitation. First, next-token prediction over a discrete vocabulary represents multi-modal conditional action distributions without collapsing modes into a single mean. Second, sequence models bring powerful context modeling and scalable pretraining recipes from language and vision to control, enabling cross-task and cross-embodiment generalization when token vocabularies are shared or aligned. Recent Vision-Language-Action (VLA) efforts articulate this reframing and its practical advantages for large, generalist robot policies (Zitkovich et al., 2023; O’Neill et al., 2024; Kim et al., 2024; Zawalski et al., 2024; Wen et al., 2025; Black et al., 2024; Zheng et al., 2024; Zhen et al., 2024; Cheang et al., 2024; Duan et al., 2024; Zhao et al., 2025). Existing action tokenizers fall into a few broad families. The simplest and most commonly used approach maps each continuous action dimension at each step to one of a fixed set of bins (Brohan et al., 2022; Zitkovich et al., 2023; Kim et al., 2024). Frequency-space methods like FAST (Pertsch et al., 2025) departs from it and instead compresses action chunks using a time-series transform and lightweight quantization. Others use Vector Quantization (VQ) as latent tokenizers. VQ-based tokenizers learn a shared codebook of action atoms and quantize continuous latent representations to nearest codebook entries (Lee et al., 2024; Wang et al., 2025). While effective at capturing multi-modal action distributions, these approaches inherently trade off reconstruction fidelity for discrete simplicity. Our work differs by leveraging tokenization solely for high-level primary mode selection. 2.3 Generative Policy A large class of imitation methods treat policy generation as a stochastic generative problem by introducing latent variables. In this view, a policy is written as a=π(o,z)a=\pi(o,z) with zz sampled from a learned prior. This formulation naturally represents multi-modal conditional action distributions because sampling different zz values yields different valid actions for the same observation. Action Chunking with Transformers (ACT) (Zhao et al., 2023) is a sequence generator with Conditional Variational Autoencoder (CVAE) as backend. Diffusion Policy (DP) (Chi et al., 2023) treat action generation as conditional denoising. Starting from noise, the action is iteratively refined via a learned score or denoiser conditioned on observation. More recent normalizing-flow policies (Black et al., 2024; Hu et al., 2024; Zhang et al., 2025) provide tractable density estimation and efficient sampling while representing complex, multi-modal action distributions. Although generative policies represent multi-modal distributions, they often face mode bouncing (Chen et al., 2025), inference cost (Li et al., 2024), chunk trade-offs (Zhao et al., 2023). Other hierarchical approaches, such as Hierarchical Diffusion Policy (HDP) (Ma et al., 2024), also use a high-level policy to guide a low-level generator. However, HDP is designed to rely on explicit, task-specific heuristics like contact-point waypoints to define its hierarchy. In contrast, our PF-DAG learns its primary modes end-to-end directly from action-chunk clusters themselves, offering a more general abstraction not tied to predefined heuristics. Thus, we propose to combine the strengths of action tokenization with expressive generative decoders that handle the residual continuous variations. Our PF-DAG decouples the primary discrete mode selection from the fine-grained action generation and reduces mode bouncing while preserving continuous variations. 2.4 Hierarchical and Residual Policies Our work is also situated within the broader context of hierarchical and residual policies for robot learning (Rana et al., 2023; Cui et al., 2025; Kujanpää et al., 2023; Liang et al., 2024). These approaches commonly decompose the complex control problem into a high-level policy that selects a skill, sub-goal, or context, and a low-level policy that executes control conditioned on the high-level selection (Mete et al., 2024; Feng et al., 2024). For instance, some methods learn residual policies that adapt a base controller (Rana et al., 2023), while others focus on discovering discrete skills from demonstration data or language guidance (Chen et al., 2023; Wan et al., 2024; Tanneberg et al., 2021). While PF-DAG shares this general hierarchical structure, its primary motivation and technical design are distinct. Many hierarchical methods focus on long-horizon planning or unsupervised skill discovery. In contrast, PF-DAG is specifically designed to address the problem of mode bouncing inherent in single-stage generative policies when modeling multi-modal action distributions at a fine temporal scale. 3 PF-DAG Formulation and Design Figure 2: Overview of our PF-DAG framework. The input observation features are extracted via Observation Feature Extraction and then fed to the Primary Mode Policy π1\pi_{1}. The GT action chunks are compressed into discrete primary modes using VQ-VAE and supervise π1\pi_{1}, which are only used in training stage. The Mode Conditioned MeanFlow Policy π2\pi_{2} takes the selected primary mode mm and observation features as input, generating high-fidelity continuous actions. This section first defines the task formulation as a closed-loop action-sequence prediction problem, and then presents the three main components of our approach: i) Observation Feature Extraction, ii) a compact discrete representation learned with a Vector-Quantized VAE (VQ-VAE) (Van Den Oord et al., 2017) and a lightweight Primary Mode Policy that predicts those discrete modes, and iii) a mode conditioned one-step continuous decoder based on MeanFlow (Geng et al., 2025). Finally, we give a theoretical analysis that quantifies why a two-stage, coarse-to-fine decomposition reduces the MSE lower bound compared to single-stage generative models. 3.1 Closed-loop Action Sequence Prediction Similar to previous work (Chi et al., 2023; Black et al., 2024), we formulate the manipulation task as closed-loop action sequence prediction. Concretely, at time tt, the observation is 𝐨t=(𝐩t,𝐬t,𝐟t)\mathbf{o}_{t}=(\mathbf{p}_{t},\mathbf{s}_{t},\mathbf{f}_{t}), where 𝐩t\mathbf{p}_{t} denotes a fixed-size point cloud, 𝐬t∈ℝds\mathbf{s}_{t}\in\mathbb{R}^{d_{s}} denotes robot proprioception, 𝐟t∈ℝ
2
3
3
8
RLHF can be significantly improved for complex tasks by explicitly modeling preference relationships both within and between training examples, unlocking better instruction following without relying on expensive human annotation or biased LLM-generated data.
RLHF reward models can be made significantly less susceptible to length bias by explicitly modeling and disentangling semantic preferences from length requirements.