HKUNortheasternApr 21, 2026arXiv:2604.19945

Visual Reasoning through Tool-supervised Reinforcement Learning

Qihua Dong, G. Şahin, Pei Wang, Zhaowei Cai, Robik Shrestha, Hao Yang, Davide Modolo

AI Summary

The paper introduces ToolsRL, a reinforcement learning framework designed to improve tool-use capabilities in multimodal large language models for visual reasoning. ToolsRL employs a two-stage curriculum: first, it optimizes for tool-specific rewards (zoom, rotate, flip, draw), and then it trains with accuracy-targeted rewards while allowing tool use. Experiments demonstrate that this tool-supervised curriculum training enhances tool-use capabilities for complex visual reasoning tasks.

Key Contribution

Forget end-to-end training: teaching models to master individual visual tools *before* tackling complex reasoning unlocks surprisingly strong performance.

Abstract

In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Models. To achieve that, we propose a novel Tool-supervised Reinforcement Learning (ToolsRL) framework, with direct tool supervision for more effective tool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. A reinforcement learning curriculum is developed, where the first stage is solely optimized by a set of well motivated tool-specific rewards, and the second stage is trained with the accuracy targeted rewards while allowing calling tools. In this way, tool calling capability is mastered before using tools to complete visual reasoning tasks, avoiding the potential optimization conflict among those heterogeneous tasks. Our experiments have shown that the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities for complex visual reasoning tasks.

Computer Vision Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Visual Reasoning through Tool-supervised Reinforcement Learning

Related Papers