Search papers, labs, and topics across Lattice.
The paper introduces DAM-VLA, a novel Vision-Language-Action (VLA) framework designed to improve robot manipulation in dynamic environments by integrating VLM reasoning with diffusion-based action models specialized for arm and gripper control. DAM-VLA employs an action routing mechanism, a dynamic action model fusing high-level VLM cognition with low-level visual features, and a dual-scale action weighting mechanism for coordinating arm and gripper actions. Experimental results demonstrate that DAM-VLA achieves higher success rates than existing VLA methods on simulated (SIMPLER, FurnitureBench) and real-world tasks, particularly in long-horizon and contact-rich scenarios.
Robots can now perform complex manipulation tasks with greater success by dynamically routing between high-level VLM reasoning and specialized diffusion-based action models for arm and gripper control.
In dynamic environments such as warehouses, hospitals, and homes, robots must seamlessly transition between gross motion and precise manipulations to complete complex tasks. However, current Vision-Language-Action (VLA) frameworks, largely adapted from pre-trained Vision-Language Models (VLMs), often struggle to reconcile general task adaptability with the specialized precision required for intricate manipulation. To address this challenge, we propose DAM-VLA, a dynamic action model-based VLA framework. DAM-VLA integrates VLM reasoning with diffusion-based action models specialized for arm and gripper control. Specifically, it introduces (i) an action routing mechanism, using task-specific visual and linguistic cues to select appropriate action models (e.g., arm movement or gripper manipulation), (ii) a dynamic action model that fuses high-level VLM cognition with low-level visual features to predict actions, and (iii) a dual-scale action weighting mechanism that enables dynamic coordination between the arm-movement and gripper-manipulation models. Across extensive evaluations, DAM-VLA achieves superior success rates compared to state-of-the-art VLA methods in simulated (SIMPLER, FurnitureBench) and real-world settings, showing robust generalization from standard pick-and-place to demanding long-horizon and contact-rich tasks.