Search papers, labs, and topics across Lattice.
The paper introduces the Structural Action Transformer (SAT) for 3D dexterous manipulation, addressing the challenge of cross-embodiment skill transfer in high-DoF robotic hands. SAT reframes action chunks as unordered sequences of joint-wise trajectories, enabling a Transformer to handle heterogeneous embodiments by treating joint count as a variable sequence length. By pre-training on large-scale heterogeneous datasets and fine-tuning, SAT achieves superior sample efficiency and cross-embodiment skill transfer compared to baselines.
Robots can now learn dexterous manipulation skills across different hand designs, thanks to a new Transformer architecture that treats actions as a flexible arrangement of joint movements, rather than a fixed sequence.
Achieving human-level dexterity in robots via imitation learning from heterogeneous datasets is hindered by the challenge of cross-embodiment skill transfer, particularly for high-DoF robotic hands. Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations and fail to handle embodiment heterogeneity. This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective. We reframe each action chunk not as a temporal sequence, but as a variable-length, unordered sequence of joint-wise trajectories. This structural formulation allows a Transformer to natively handle heterogeneous embodiments, treating the joint count as a variable sequence length. To encode structural priors and resolve ambiguity, we introduce an Embodied Joint Codebook that embeds each joint's functional role and kinematic properties. Our model learns to generate these trajectories from 3D point clouds via a continuous-time flow matching objective. We validate our approach by pre-training on large-scale heterogeneous datasets and fine-tuning on simulation and real-world dexterous manipulation tasks. Our method consistently outperforms all baselines, demonstrating superior sample efficiency and effective cross-embodiment skill transfer. This structural-centric representation offers a new path toward scaling policies for high-DoF, heterogeneous manipulators.