Search papers, labs, and topics across Lattice.
The paper introduces HSC-VLA, a hierarchical vision-language-action framework designed to improve bimanual manipulation in cluttered environments. HSC-VLA decouples high-level reasoning from low-level execution by using a "Brain" module to generate task-specific scene masks that filter out irrelevant visual information, which are then fed into a "Cerebellum" module for diffusion-based policy execution. Experiments in cluttered supermarket shelves show that HSC-VLA significantly outperforms monolithic baselines, achieving 86.7% success compared to 34.3% for the best baseline, and demonstrates strong long-horizon performance in clutter sorting and restocking tasks.
By explicitly filtering out visual clutter, HSC-VLA achieves a 52.4% performance boost in complex bimanual manipulation tasks compared to monolithic approaches.
Modern Vision--Language--Action models often suffer from critical instruction-following failures in high-density manipulation environments, where task-irrelevant visual clutter dilutes attention, corrupts grounding, and substantially degrades performance in complex long-horizon scenarios. To overcome the representation bottleneck of monolithic end-to-end architectures, we propose HSC-VLA, a hierarchical framework that decouples high-level visual-semantic reasoning from low-level, high-frequency sensorimotor execution through an explicit scene-clearing abstraction. HSC-VLA employs a high-level Brain to decompose long-horizon tasks and to generate task-specific scene masks that preserve task-relevant geometry while suppressing distractors. The filtered observations are then passed to a low-level Cerebellum, a diffusion-based policy that performs bimanual manipulation using only mask-filtered vision and proprioception. Extensive experiments in densely cluttered supermarket shelves demonstrate that HSC-VLA achieves 86.7\% aggregate success under high-density clutter, surpassing the best monolithic baseline ($π_0$-Full FT at 34.3\%) by 52.4\%. HSC-VLA also exhibits strong long-horizon performance, reaching 72\% on clutter sorting and 66\% on restocking, demonstrating strong robustness and effective failure recovery in complex cluttered manipulation.