Jun 17, 2026arXiv:2606.19194

Invertible Neural Network Adapter for One-Step Flow Matching in Robot Manipulation

Yu Zhang, Kangyi Ji, Yongxiang Zou, Rongtao Xu, Feng Zheng, Long Cheng

AI Summary

This paper introduces an invertible neural network adapter that facilitates high-dimensional action generation for robotic manipulation by leveraging a one-step denoising process based on multimodal inputs. The method operates within an invertible latent space, significantly reducing inference complexity compared to traditional iterative flow-matching policies while ensuring high accuracy and stability in action predictions. Experimental results across various simulation benchmarks and real-world robotic platforms demonstrate that the adapter not only achieves state-of-the-art performance but also improves inference efficiency, cutting average latency by over 40%.

Key Contribution

Achieving high-dimensional action synthesis in robotics with a single inference step could redefine efficiency benchmarks in real-time manipulation tasks.

Abstract

This paper presents an invertible neural network adapter for general robotic manipulation, designed to generate precise high-dimensional actions conditioned on multimodal observations, including visual, linguistic, and proprioceptive inputs, through a one-step denoising process. Built upon a flow-matching formulation, the proposed adapter effectively constrains the action generation trajectory within an invertible latent space, thereby enabling efficient and high-quality dexterous action synthesis with only a single inference step. Compared with conventional iterative flow-matching policies, the proposed framework substantially reduces inference complexity while maintaining strong action prediction accuracy and stability. Extensive experiments are conducted across a diverse set of simulation benchmarks and real-world robotic platforms to evaluate the effectiveness of the proposed method. Across simulation benchmarks, the proposed adapter consistently demonstrates superior or near state-of-the-art performance on a wide range of manipulation tasks. Furthermore, real-world experiments reveal a significant improvement in inference efficiency for vision-language-action (VLA) models, reducing the average inference latency from 110 ms to 61 ms while maintaining strong task performance.

Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Invertible Neural Network Adapter for One-Step Flow Matching in Robot Manipulation

Related Papers