UMichUniversity of Shanghai for Science and TechnologyMar 16, 2026arXiv:2603.15030

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

Xuanyu Zhu, Yuhao Dong, Rundong Wang, Yang Shi, Zhipeng Wu, Yinlun Peng, Yifan Zhang, Yihang Lou, Yuanxing Zhang, Ziwei Liu, Yan Bai, Yuan Zhou

AI Summary

VTC-Bench, a new benchmark, is introduced to evaluate the tool-use proficiency of Multimodal Large Language Models (MLLMs) by assessing their ability to compose and execute diverse visual tools. The benchmark features 32 OpenCV-based visual operations, enabling complex, multi-tool compositions and long-horizon planning across 680 curated problems organized by cognitive hierarchy. Experiments on 19 leading MLLMs, including Gemini-3.0-Pro (achieving 51%), reveal limitations in adapting to diverse tool-sets, generalizing to unseen operations, and formulating efficient execution plans for complex tasks.

Key Contribution

MLLMs still fumble at visual tool use, struggling to compose even basic OpenCV operations into effective plans, as revealed by a new benchmark where the best model only scores 51%.

Abstract

Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench~(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition and long-horizon, multi-step plan execution. For precise evaluation, we provide 680 curated problems structured across a nine-category cognitive hierarchy, each with ground-truth execution trajectories. Extensive experiments on 19 leading MLLMs reveal critical limitations in current models' visual agentic capabilities. Specifically, models struggle to adapt to diverse tool-sets and generalize to unseen operations, with the leading model Gemini-3.0-Pro only achieving 51\% on our benchmark. Furthermore, multi-tool composition remains a persistent challenge. When facing complex tasks, models struggle to formulate efficient execution plans, relying heavily on a narrow, suboptimal subset of familiar functions rather than selecting the optimal tools. By identifying these fundamental challenges, VTC-Bench establishes a rigorous baseline to guide the development of more generalized visual agentic models.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

Related Papers