Search papers, labs, and topics across Lattice.
This study critically evaluates the effectiveness of tool-augmented multimodal agents, specifically Thyme and DeepEyesV2, by comparing their performance against tool-free counterparts and pure-text reasoners across various tasks. The findings reveal that while tool access is often associated with benchmark gains, it does not consistently translate into meaningful improvements in problem-solving capabilities, as a significant majority of tool-solved problems can also be addressed without tools. Furthermore, the analysis indicates that agents may learn to call tools without actually leveraging their capabilities, underscoring the need for more nuanced evaluations of tool use in AI systems.
Tool-augmented multimodal agents may appear to excel, but they often rely on learned tool-calling patterns rather than enhanced problem-solving abilities.
Tool-augmented multimodal agents show strong benchmark gains, often taken as evidence that agents have learned to use tools. We argue that this interpretation can be premature: a tool-call trace alone does not show whether the tool supplied answer-critical information. We study two representative ``thinking with images'' agents, Thyme and DeepEyesV2, across real-world understanding, OCR, chart understanding, and mathematical reasoning. Each agent is compared with its Tool-Free counterpart and with a Pure-Text Reasoner trained from the same source pool without tool-calling trajectories. Tool access yields little consistent aggregate improvement, does not reliably reduce generated-token cost, and leaves only a small tool-only solved set: 93% of DeepEyesV2's tool-solved problems and 96% of Thyme's are also solved by at least one non-tool setting. Mechanism ablations further show that the full tool-use loop does not consistently outperform either the tool-call format or the returned execution result alone. In the settings we study, the analyzed agents appear to learn tool-calling patterns more reliably than tool-contributed capabilities, suggesting that evaluation should distinguish tool availability from whether tools actually expand what agents can solve.