Apr 23, 2026arXiv:2604.21391

From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges

Yiming Zhong, Yaoyu He, Zemin Yang, Pengfei Tian, Yifan Huang, Qingqiu Huang, Xinge Zhu, Yuexin Ma

AI Summary

The paper introduces ResVLA, a generative Vision-Language-Action (VLA) policy architecture for embodied intelligence that decomposes control into a deterministic low-frequency anchor representing global intent and a stochastic high-frequency residual for local dynamics. ResVLA uses spectral analysis to achieve this decoupling and employs a residual diffusion bridge to refine local dynamics conditioned on the predicted intent. Experiments demonstrate that ResVLA achieves competitive performance, robustness, and faster convergence compared to standard generative baselines, validated in both simulation and real-world robot experiments.

Key Contribution

By spectrally decoupling robot control into intent and dynamics, ResVLA offers a more efficient and robust approach to generative VLA policies.

Abstract

Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a"Generation-from-Noise"paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization. In this work, we propose ResVLA, an architecture that shifts the paradigm to"Refinement-from-Intent."Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual diffusion bridge. Extensive simulation experiments show that ResVLA achieves competitive performance, strong robustness to language and robot embodiment perturbations, and faster convergence than standard generative baselines. It also demonstrates strong performance in real-world robot experiments.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References55

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges

Related Papers