Search papers, labs, and topics across Lattice.
This paper introduces VoLoAgent, a versatile agent that utilizes a vision-language model (VLM) to orchestrate various robotic capabilities for open-vocabulary long-horizon manipulation tasks. By implementing a closed agent loop that allows for adaptive planning and real-time decision-making, VoLoAgent effectively manages the complexities of physical environments, outperforming traditional single VLA/VLM or tool-based systems. The authors also present RoboVoLo, a comprehensive benchmark that evaluates the agent's performance across multiple dimensions, demonstrating significant improvements in task execution and failure recovery in real-robot scenarios.
VoLoAgent outperforms traditional manipulation systems by seamlessly integrating planning, monitoring, and recovery in real-time, transforming how robots handle complex tasks in dynamic environments.
Open-vocabulary long-horizon manipulation requires robots to reason over flexible instructions and complex multi-object scenes while adaptively planning, executing, monitoring, and recovering from failures. We address these demands with a closed agent loop in which a VLM orchestrates heterogeneous robot capabilities as interruptible tools. Unlike in virtual AI agents, the timing of decisions, actions and tool calls is important in a physical world that does not pause for reasoning. We refer to this setting as Physical Orchestration, and propose VoLoAgent, a VLM that plans, monitors, and recovers by treating a VLA/WAM as an interruptible tool it steers mid-rollout alongside vision models and action primitives. To evaluate these long-horizon capabilities, we introduce RoboVoLo, a high-fidelity benchmark for open-vocabulary long-horizon manipulation across common sense, memory/state tracking, complex references, and world knowledge, with both task-level success and failure-mode diagnostics. Experiments show VoLoAgent substantially outperforms single VLA/VLM or tool-based systems, with validation on real-robot experiments. Project page: https://chicychen.github.io/VoLo/