Lancaster UniversityPKUJun 1, 2026arXiv:2606.02518

ToolFG: Towards Well-Grounded Fine-Grained Image Classification

Yu Xue, Haoxuan Qu, Zhuoling Li, Yihang Lou, Yan Bai, Hossein Rahmani, Jun Liu

AI Summary

This paper introduces ToolFG, a novel framework that integrates multi-modal large language models (MLLMs) with external tools to enhance fine-grained image classification (FGIC). By employing a MCTS-guided tool-use knowledge distillation mechanism, ToolFG allows models to autonomously interact with images and gather verifiable visual cues, leading to more reliable and well-grounded classifications of similar categories. The proposed model-tool co-evolution mechanism further refines both the toolset and the model's policies, resulting in significant improvements in FGIC performance as demonstrated through extensive experiments.

Key Contribution

ToolFG revolutionizes fine-grained image classification by enabling MLLMs to autonomously leverage external tools for enhanced reliability and accuracy.

Abstract

Fine-grained image classification (FGIC) has broad applications and has attracted significant research attention. In this paper, we explore a novel paradigm for solving FGIC by proposing \textbf{ToolFG}, the first tool-integrated MLLM-based framework tailored to FGIC. ToolFG enables MLLMs to autonomously and flexibly use external tools during the reasoning process, actively interact with images, and collect verifiable visual cues for distinguishing highly similar categories in a more \textit{reliable} and \textit{well-grounded} manner. To equip the model with such tool-use ability, we design a novel \textbf{MCTS-guided tool-use knowledge distillation mechanism}, which effectively mines tool-use- and FGIC-relevant knowledge from advanced proprietary MLLMs for model training. Furthermore, we propose a \textbf{model-tool co-evolution mechanism} that jointly refines the toolset and the model's tool-use policy, driving them toward a mutually adapted and FGIC-specialized state. Extensive experiments demonstrate the effectiveness of our framework.

Computer Vision Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ToolFG: Towards Well-Grounded Fine-Grained Image Classification

Related Papers