Mar 8, 2026arXiv:2603.07484

HSC-VLA: Hierarchical Scene-Clearing for Robust Bimanual Manipulation in Dense Clutter

Zhen Liu, Xinyu Ning, Zhe Hu, XinXin Xie, Yitong Liu, Zhongzhu Pu

AI Summary

The paper introduces HSC-VLA, a hierarchical vision-language-action framework designed to improve bimanual manipulation in cluttered environments. HSC-VLA decouples high-level reasoning from low-level execution by using a "Brain" module to generate task-specific scene masks that filter out irrelevant visual information, which are then fed into a "Cerebellum" module for diffusion-based policy execution. Experiments in cluttered supermarket shelves show that HSC-VLA significantly outperforms monolithic baselines, achieving 86.7% success compared to 34.3% for the best baseline, and demonstrates strong long-horizon performance in clutter sorting and restocking tasks.

Key Contribution

By explicitly filtering out visual clutter, HSC-VLA achieves a 52.4% performance boost in complex bimanual manipulation tasks compared to monolithic approaches.

Abstract

Modern Vision--Language--Action models often suffer from critical instruction-following failures in high-density manipulation environments, where task-irrelevant visual clutter dilutes attention, corrupts grounding, and substantially degrades performance in complex long-horizon scenarios. To overcome the representation bottleneck of monolithic end-to-end architectures, we propose HSC-VLA, a hierarchical framework that decouples high-level visual-semantic reasoning from low-level, high-frequency sensorimotor execution through an explicit scene-clearing abstraction. HSC-VLA employs a high-level Brain to decompose long-horizon tasks and to generate task-specific scene masks that preserve task-relevant geometry while suppressing distractors. The filtered observations are then passed to a low-level Cerebellum, a diffusion-based policy that performs bimanual manipulation using only mask-filtered vision and proprioception. Extensive experiments in densely cluttered supermarket shelves demonstrate that HSC-VLA achieves 86.7\% aggregate success under high-density clutter, surpassing the best monolithic baseline ($π_0$-Full FT at 34.3\%) by 52.4\%. HSC-VLA also exhibits strong long-horizon performance, reaching 72\% on clutter sorting and 66\% on restocking, demonstrating strong robustness and effective failure recovery in complex cluttered manipulation.

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HSC-VLA: Hierarchical Scene-Clearing for Robust Bimanual Manipulation in Dense Clutter

Related Papers