Apr 21, 2026arXiv:2604.19857

Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

Carter Adams, Rafael Oliveira, Gabriel Almeida, Sofia Torres

AI Summary

This paper introduces the Tool-Augmented Markov Decision Process (TA-MDP) framework to formally analyze reinforcement fine-tuning (RFT) of large vision-language models (LVLMs) for agentic tasks. They prove convergence of Group Relative Policy Optimization (GRPO) under composite verifiable rewards, derive a Reward Decomposition Theorem characterizing when optimizing reward components separately is beneficial, and establish a PAC-Bayes generalization bound explaining out-of-distribution transfer. The results provide theoretical justification for the empirical success of methods like Visual-ARFT.

Key Contribution

Decomposing complex, verifiable rewards in LVLM reinforcement fine-tuning provably accelerates convergence and improves generalization, offering a principled alternative to monolithic reward optimization.

Abstract

Reinforcement fine-tuning with verifiable rewards (RLVR) has emerged as a powerful paradigm for equipping large vision-language models (LVLMs) with agentic capabilities such as tool use and multi-step reasoning. Despite striking empirical successes, most notably Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT), the theoretical underpinnings of this paradigm remain poorly understood. In particular, two critical questions lack rigorous answers: (i)~how does the composite structure of verifiable rewards (format compliance, answer accuracy, tool executability) affect the convergence of Group Relative Policy Optimization (GRPO), and (ii)~why does training on a small set of tool-augmented tasks transfer to out-of-distribution domains? We address these gaps by introducing the \emph{Tool-Augmented Markov Decision Process} (TA-MDP), a formal framework that models multimodal agentic decision-making with bounded-depth tool calls. Within this framework, we establish three main results. First, we prove that GRPO under composite verifiable rewards converges to a first-order stationary point at rate $O(1/\sqrt{T})$ with explicit dependence on the number of reward components and group size (\textbf{Theorem~1}). Second, we derive a \emph{Reward Decomposition Theorem} that bounds the sub-optimality gap between decomposed per-component optimization and joint optimization, providing a precise characterization of when reward decomposition is beneficial (\textbf{Theorem~2}). Third, we establish a PAC-Bayes generalization bound for tool-augmented policies that explains the strong out-of-distribution transfer observed in Visual-ARFT (\textbf{Theorem~3}).

Multimodal Models RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

Related Papers