Mar 19, 2026arXiv:2603.18656

Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

Shaked Perek, Ben Wiesel, Avihu Dekel, Nimrod Shabtay, Eli Schwartz

AI Summary

The paper introduces SCALe (Scheduled Curriculum Adaptive Loss), a novel loss function for vision-language model supervised fine-tuning that addresses token imbalance by dynamically weighting reasoning and answer segments. SCALe uses a cosine scheduling policy to shift focus from reasoning to answer segments during training, promoting concise and accurate reasoning. Experiments demonstrate that SCALe-SFT achieves comparable performance to a full SFT + GRPO pipeline with significantly reduced training time, and further improves performance when combined with GRPO.

Key Contribution

Skip reinforcement learning and still get SOTA vision-language reasoning performance with a simple loss re-weighting scheme that cuts training time by 7x.

Abstract

Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Longtraces overshadow short but task-criticalsegments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights thesegment, SCALe-SFT gradually shifts the focus fromtothroughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.

Multimodal Models Reasoning & Chain-of-Thought Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References34

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

Related Papers