Search papers, labs, and topics across Lattice.
This paper introduces a capability externalization approach for embodied agents, decoupling perception, reasoning, planning, and control into independently optimized tools invoked via a standardized Embodied Tool Protocol (ETP). They curate a tool base of 100+ validated tools and construct EmbodiedToolBench to evaluate tool augmentation across various embodied tasks. Results show significant performance gains (31% on EB-ALFRED, 36% on EB-Navigation) for cognition and perception-heavy tasks, but limited improvement for execution-focused capabilities, highlighting challenges in tool selection and invocation.
Embodied agents get a 30%+ performance boost by using external tools for perception and reasoning, but still struggle to pick the right tool for the job.
Most existing embodied intelligence methods formulate perception, reasoning, planning, and control within a unified parameterized policy. Yet these capabilities are inherently hierarchical and heterogeneous, making them difficult to reliably learn and modularize within a single model. We propose a capability externalization approach that decouples heterogeneous capabilities into independently optimized tools, dynamically invoked at inference time. To this end, we introduce Embodied Tool Protocol (ETP), a standardized protocol for embodied tool registration, discovery, invocation, and execution, and curate 100+ validated tools spanning perception, cognition, reasoning, and execution as the tool base. Building on this, we construct EmbodiedToolBench to evaluate both whether tool augmentation improves embodied performance and how well current models use tools across tool-necessity recognition, tool selection, tool execution, and tool-chain composition. Experiments across simulation and real-world platforms confirm that capability externalization consistently improves embodied performance (avg. gain 31% on EB-ALFRED and 36% on EB-Navigation), yet reveal a clear boundary: gains are substantial for cognition and perception but are limited for execution-type capabilities. Moreover, our analysis reveals that knowing when, which, and how to invoke tools remains a persistent challenge across all models, thereby highlighting embodied tool competence as a critical direction for future research.