PolyUSMUJun 9, 2026arXiv:2606.10803

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

Zhixin Ma, Yutong Zhou, Yongqi Li, Chong-Wah Ngo, Wenjie Li

AI Summary

This study introduces PhysTool-Bench, a benchmark designed to assess the capabilities of Multimodal Large Language Models (MLLMs) in physical tool use across various real-world scenarios. Despite the increasing role of MLLMs in embodied AI, evaluation results show that even the top-performing model, Gemini-3.1-Pro, only recognizes 58.7% of tools and successfully completes 21.0% of tasks, highlighting significant deficiencies in both tool perception and planning. These findings underscore a critical gap in functional commonsense reasoning that hampers the practical application of MLLMs in real-world tasks involving physical tools.

Key Contribution

MLLMs are failing to recognize and effectively utilize physical tools, with top models achieving only 21% task completion in real-world scenarios.

Abstract

Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

Related Papers