Apr 28, 2026arXiv:2604.25907

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

AI Summary

The paper introduces a novel loss family, $J_Q$, based on the Tsallis q-logarithm, that interpolates between RLVR and log-marginal-likelihood to improve the training of reasoning models with output-level supervision. This loss family addresses the "cold-start" problem where initial success probabilities are low by reweighting instances using a scalar amplification factor derived from the Tsallis q-logarithm. They derive two Monte Carlo estimators, GARL and PAFT, to approximate the intractable amplification factor, and empirically demonstrate that GARL significantly mitigates cold-start stalling and PAFT provides stable gradients, leading to substantial performance gains on FinQA, HotPotQA, and MuSiQue datasets.

Key Contribution

Stuck training your reasoning model with RLVR due to a low initial success rate? This paper shows how a Tsallis q-logarithm loss can jumpstart learning by adaptively amplifying gradients, achieving a +14.4 point boost over GRPO on HotPotQA.

Abstract

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis $q$-logarithm, we define a loss family $J_Q$ that interpolates between RLVR (at $q{=}0$, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification $P_{\theta^{-q}}$ that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires $\Omega(\frac{1}{p_0})$ time to escape cold start, while the density-estimation pole escapes in $\Theta\big(\log(\frac{1}{p_0})\big)$; intermediate $q$ trades escape speed against noise memorization. Because $P_\theta$ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias $O\big(\frac{q}{M P_{\theta}^{q+1}}\big)$; GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at $q{=}0.75$ substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low $q$ dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at $q{=}0.75$ provides stable gradients (best overall on HotPotQA at 47.9 maj@16, $+14.4$ over GRPO).

Reasoning & Chain-of-Thought Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References38

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Related Papers