NVIDIACMU RIKITJun 10, 2026arXiv:2606.12105

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

Pankhuri Vanjani, Zhuoyue Li, Jakub Suliga, Moritz Reuss, G. Geraci, Gianluca Geraci, Xinkai Jiang, Rudolf Lioutikov

AI Summary

This paper introduces DAM-VLA, a decoupled asynchronous multimodal vision-language-action model that addresses the limitations of synchronous processing in existing VLA models. By allowing each modality to update at its own sensor rate, DAM-VLA significantly enhances the representation strength and control robustness, particularly in high-frequency environments. The model achieves an impressive average success rate of 95.2% across seven manipulation tasks, more than doubling the performance of the best synchronous baseline.

Key Contribution

Decoupling modality processing in VLA models leads to a staggering 95.2% success rate in complex manipulation tasks, far surpassing traditional synchronous approaches.

Abstract

Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control. We present DAM-VLA, which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2\% vs.\ 40.95\%) while sustaining smooth, reactive 100\,Hz control. Project website: \href{https://intuitive-robots.github.io/DAM-VLA/}{intuitive-robots.github.io/DAM-VLA/}

Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

Related Papers