Search papers, labs, and topics across Lattice.
This paper introduces an invertible neural network adapter that facilitates high-dimensional action generation for robotic manipulation by leveraging a one-step denoising process based on multimodal inputs. The method operates within an invertible latent space, significantly reducing inference complexity compared to traditional iterative flow-matching policies while ensuring high accuracy and stability in action predictions. Experimental results across various simulation benchmarks and real-world robotic platforms demonstrate that the adapter not only achieves state-of-the-art performance but also improves inference efficiency, cutting average latency by over 40%.
Achieving high-dimensional action synthesis in robotics with a single inference step could redefine efficiency benchmarks in real-time manipulation tasks.
This paper presents an invertible neural network adapter for general robotic manipulation, designed to generate precise high-dimensional actions conditioned on multimodal observations, including visual, linguistic, and proprioceptive inputs, through a one-step denoising process. Built upon a flow-matching formulation, the proposed adapter effectively constrains the action generation trajectory within an invertible latent space, thereby enabling efficient and high-quality dexterous action synthesis with only a single inference step. Compared with conventional iterative flow-matching policies, the proposed framework substantially reduces inference complexity while maintaining strong action prediction accuracy and stability. Extensive experiments are conducted across a diverse set of simulation benchmarks and real-world robotic platforms to evaluate the effectiveness of the proposed method. Across simulation benchmarks, the proposed adapter consistently demonstrates superior or near state-of-the-art performance on a wide range of manipulation tasks. Furthermore, real-world experiments reveal a significant improvement in inference efficiency for vision-language-action (VLA) models, reducing the average inference latency from 110 ms to 61 ms while maintaining strong task performance.