Search papers, labs, and topics across Lattice.
The paper introduces Offline eXploration-Aware (OXA) fine-tuning, a novel approach to improve the mathematical reasoning capabilities of large language models by optimizing supervised fine-tuning (SFT) before reinforcement learning from verifiable rewards (RLVR). OXA promotes low-confidence verified teacher-distillation data and suppresses high-confidence incorrect self-distillation data to improve the initial policy. Experiments on six benchmarks demonstrate that OXA consistently improves mathematical reasoning performance, with an average gain of +6 Pass@1 and +5 Pass@k compared to conventional SFT on the Qwen2.5-1.5B-Math model, and that these gains persist through RLVR training.
Supervised fine-tuning can be dramatically improved by explicitly encouraging exploration of low-confidence data and suppressing high-confidence errors, leading to sustained gains in mathematical reasoning even after extensive RLVR training.
Through encouraging self-exploration, reinforcement learning from verifiable rewards (RLVR) has significantly advanced the mathematical reasoning capabilities of large language models. As the starting point for RLVR, the capacity of supervised fine-tuning (SFT) to memorize new chain-of-thought trajectories provides a crucial initialization that shapes the subsequent exploration landscape. However, existing research primarily focuses on facilitating exploration during RLVR training, leaving exploration-aware SFT under-explored. To bridge this gap, we propose Offline eXploration-Aware (OXA) fine-tuning. Specifically, OXA optimizes two objectives: promoting low-confidence verified teacher-distillation data to internalize previously uncaptured reasoning patterns, and suppressing high-confidence incorrect self-distillation data to redistribute probability mass of incorrect patterns toward potentially correct candidates. Experimental results across 6 benchmarks show that OXA consistently improves mathematical reasoning performance, especially achieving an average gain of $+6$ Pass@1 and $+5$ Pass@$k$ points compared to conventional SFT on the Qwen2.5-1.5B-Math. Crucially, OXA elevates initial policy entropy, and performance gains persist throughout extensive RLVR training, demonstrating the long-term value of OXA.