BeihangBITHUSTZhongguancun AcademyZhongguancun LaboratoryMay 26, 2026arXiv:2605.26656

DV-SFT: Direct Vision Supervision for Fine-Grained Visual Understanding

Jianfei Zhao, Chong Feng, Bing Wang, Zhixing Tan

AI Summary

The paper introduces Direct Vision Supervised Fine-Tuning (DV-SFT), a method to explicitly supervise visual tokens in multimodal LLMs by leveraging direct vision-text correspondence in OCR-related scenarios. DV-SFT automatically labels visual tokens with corresponding words from image patches and trains them using the standard next-token prediction objective, without architectural modifications or extra forward passes. Experiments show DV-SFT consistently outperforms standard SFT on in-domain and out-of-domain benchmarks, enhancing fine-grained visual understanding and multimodal alignment.

Key Contribution

MLLMs can be significantly improved by directly supervising visual tokens with corresponding text, without needing architectural changes or extra computation.

Abstract

Multimodal large language models are typically trained end-to-end to predict ground-truth answers, yet supervision signals are applied exclusively to text tokens. Visual tokens, the core carriers of visual information, are optimized only implicitly as part of the context, leading to coarse-grained visual understanding. Prior works attempt to supervise visual inputs but inevitably rely on auxiliary components such as additional decoders or forward passes, because visual tokens lack readily interpretable labels. This limits their practical applicability. In this work, we propose \textbf{D}irect \textbf{V}ision \textbf{S}upervised \textbf{F}ine-\textbf{T}uning (DV-SFT), which constructs explicit, token-level supervision for visual tokens and trains them through the same next-token prediction objective used for text. Specifically, we exploit the direct vision--text correspondence in OCR-related scenarios and automatically label each visual token with the word in its corresponding image patch. DV-SFT treats the MLLM as a black box, requiring no architectural modifications or additional forward passes. Extensive experiments demonstrate the superiority of direct vision supervision. DV-SFT consistently outperforms standard SFT across three in-domain and four out-of-domain benchmarks. Further analyses show that vision supervision effectively enhances fine-grained visual understanding and achieves higher multimodal alignment efficiency.

Computer Vision Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DV-SFT: Direct Vision Supervision for Fine-Grained Visual Understanding

Related Papers