UC DavisApr 8, 2026arXiv:2604.06628

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

Qihan Ren, Peng Wang, Rui Cai, Ruikun Cai, Shuai Shao, Dadi Guo, Yuejin Xie, Yafu Li, Quanshi Zhang, Xia Hu, Jing Shao, Dongrui Liu

AI Summary

This paper investigates the generalization capabilities of reasoning SFT, challenging the notion that it only memorizes. The authors find that cross-domain generalization in reasoning SFT is conditional, influenced by optimization dynamics (observing a "dip-and-recovery" pattern with extended training), training data quality/structure, and base model capability. They also highlight an asymmetry where reasoning improves while safety degrades, emphasizing the need to consider the conditions and costs of generalization in reasoning SFT.

Key Contribution

Reasoning SFT doesn't just memorize, it generalizes—but only if you train it long enough, feed it good data, and use a capable model, and even then, reasoning gains come at the cost of safety.

Abstract

A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.

Data Curation & Synthetic Data Reasoning & Chain-of-Thought Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References41

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

Related Papers