Google ResearchGeorgia TechUNCFeb 25, 2026arXiv:2602.21531

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

Yue Yang, Shuo Cheng, Homanga Bharadhwaj, Mingyu Ding, Gedas Bertasius, Daniel Szafir

AI Summary

The paper introduces LiLo-VLA, a modular framework for long-horizon manipulation tasks that decomposes the problem into a Reaching Module for global motion and an object-centric Interaction Module using a Vision-Language-Action (VLA) model. This decoupling enhances robustness against irrelevant visual features and spatial configuration changes, while also enabling dynamic replanning for failure recovery. LiLo-VLA achieves a 69% average success rate across a 21-task simulation benchmark (LIBERO-Long++ and Ultra-Long) and an 85% success rate in real-world evaluations, significantly outperforming existing VLA-based approaches.

Key Contribution

By decomposing long-horizon manipulation into transport and object-centric interaction, LiLo-VLA achieves state-of-the-art zero-shot generalization and robustness, outperforming end-to-end VLA models by a large margin.

Abstract

General-purpose robots must master long-horizon manipulation, defined as tasks involving multiple kinematic structure changes (e.g., attaching or detaching objects) in unstructured environments. While Vision-Language-Action (VLA) models offer the potential to master diverse atomic skills, they struggle with the combinatorial complexity of sequencing them and are prone to cascading failures due to environmental sensitivity. To address these challenges, we propose LiLo-VLA (Linked Local VLA), a modular framework capable of zero-shot generalization to novel long-horizon tasks without ever being trained on them. Our approach decouples transport from interaction: a Reaching Module handles global motion, while an Interaction Module employs an object-centric VLA to process isolated objects of interest, ensuring robustness against irrelevant visual features and invariance to spatial configurations. Crucially, this modularity facilitates robust failure recovery through dynamic replanning and skill reuse, effectively mitigating the cascading errors common in end-to-end approaches. We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long. In these simulations, LiLo-VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA-OFT by 67%. Furthermore, real-world evaluations across 8 long-horizon tasks demonstrate an average success rate of 85%. Project page: https://yy-gx.github.io/LiLo-VLA/.

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

Related Papers