Mar 1, 2026arXiv:2603.00926

DAM-VLA: A Dynamic Action Model-Based Vision-Language-Action Framework for Robot Manipulation

Xiongfeng Peng, Jiaqian Yu, Dingzhe Li, Yixiang Jin, Lu Xu, Yamin Mao, Chao Zhang, Weiming Li, Sujin Jang, Dongwook Lee, Daehyun Ji

AI Summary

The paper introduces DAM-VLA, a novel Vision-Language-Action (VLA) framework designed to improve robot manipulation in dynamic environments by integrating VLM reasoning with diffusion-based action models specialized for arm and gripper control. DAM-VLA employs an action routing mechanism, a dynamic action model fusing high-level VLM cognition with low-level visual features, and a dual-scale action weighting mechanism for coordinating arm and gripper actions. Experimental results demonstrate that DAM-VLA achieves higher success rates than existing VLA methods on simulated (SIMPLER, FurnitureBench) and real-world tasks, particularly in long-horizon and contact-rich scenarios.

Key Contribution

Robots can now perform complex manipulation tasks with greater success by dynamically routing between high-level VLM reasoning and specialized diffusion-based action models for arm and gripper control.

Abstract

In dynamic environments such as warehouses, hospitals, and homes, robots must seamlessly transition between gross motion and precise manipulations to complete complex tasks. However, current Vision-Language-Action (VLA) frameworks, largely adapted from pre-trained Vision-Language Models (VLMs), often struggle to reconcile general task adaptability with the specialized precision required for intricate manipulation. To address this challenge, we propose DAM-VLA, a dynamic action model-based VLA framework. DAM-VLA integrates VLM reasoning with diffusion-based action models specialized for arm and gripper control. Specifically, it introduces (i) an action routing mechanism, using task-specific visual and linguistic cues to select appropriate action models (e.g., arm movement or gripper manipulation), (ii) a dynamic action model that fuses high-level VLM cognition with low-level visual features to predict actions, and (iii) a dual-scale action weighting mechanism that enables dynamic coordination between the arm-movement and gripper-manipulation models. Across extensive evaluations, DAM-VLA achieves superior success rates compared to state-of-the-art VLA methods in simulated (SIMPLER, FurnitureBench) and real-world settings, showing robust generalization from standard pick-and-place to demanding long-horizon and contact-rich tasks.

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DAM-VLA: A Dynamic Action Model-Based Vision-Language-Action Framework for Robot Manipulation

Related Papers