Search papers, labs, and topics across Lattice.
The paper investigates the limitations of current Vision-Language Models (VLMs), finding that visual perception, rather than reasoning, is the primary bottleneck in visual tasks. They propose a staged training approach that decouples visual perception, visual reasoning, and textual reasoning, using specialized data and reinforcement learning to optimize each stage. Results show that this staged training improves both visual perception and reasoning performance, leading to higher accuracy with shorter reasoning traces compared to traditional merged training.
VLMs struggle more with *seeing* than *thinking*, and targeted pre-training on visual perception alone unlocks surprisingly large gains in downstream reasoning.
Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.