Apr 15, 2026arXiv:2604.14125

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Guanyu Chen, Yutian Chen, Zhixuan Liang, Yitian Liu, Zanxin Chen, Chunpu Xu, Haotian Liang, Jiangmiao Pang, Yao Mu, Ping Luo

AI Summary

HiVLA addresses the trade-off between reasoning and control in Vision-Language-Action models by decoupling high-level semantic planning (using a VLM) from low-level motor control (using a diffusion transformer). The VLM planner performs task decomposition and visual grounding to generate structured plans with subtask instructions and target bounding boxes. A flow-matching Diffusion Transformer (DiT) action expert then translates these plans into actions using a cascaded cross-attention mechanism to fuse global context, object-centric crops, and skill semantics, resulting in improved performance, especially in long-horizon tasks and fine-grained manipulation.

Key Contribution

Decoupling high-level VLM planning from low-level diffusion-based control lets robots reason like foundation models *and* execute precisely, outperforming end-to-end approaches in complex manipulation tasks.

Abstract

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References42

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Related Papers