Tsinghua AIBeihangBeijing Advanced Innovation Center for FutureHangzhou International InnovationNTUMay 26, 2026arXiv:2605.26924

Learning to Adapt SFT Data for Better Reasoning Generalization

Lisong Sun, Li Wang, Chen Zhang, Jinyang Wu, Kui Zhang, Tianhao Peng, Wenjun Wu

AI Summary

The paper introduces Data Adaptation for Reasoning Tuning (DART), a method to improve reasoning generalization in LLMs by adapting SFT data to better match the target model's distribution. DART uses reinforcement learning to train a mapper model that transforms the original SFT data into model-adapted supervision. Experiments demonstrate that DART enhances generalization, improves training efficiency compared to direct RL, and outperforms standard SFT across various models and datasets.

Key Contribution

Mismatched SFT data hurting your LLM's reasoning? DART uses RL to transform it into perfectly aligned training examples, boosting generalization and efficiency.

Abstract

Large language models (LLMs) have achieved remarkable progress, with post-training playing a crucial role in enhancing their reasoning capabilities. Among post-training paradigms, supervised fine-tuning (SFT) is widely used: it leverages external data to provide dense supervision and enables efficient training. However, directly fine-tuning on expert data can hurt generalization when the data distribution is mismatched with the target model's own distribution. In this work, we propose Data Adaptation for Reasoning Tuning (DART), which formulates the use of a fixed, potentially distributionally misaligned SFT dataset as an optimization problem over demonstration transformations. DART trains a mapper model with reinforcement learning to convert original SFT data into model-adapted supervision that better matches the target model's distribution and learning preferences. The transformed data are then used for SFT, allowing the target model to better exploit external supervision. Experiments across multiple models and datasets show that DART improves generalization, achieves higher training efficiency than direct RL, and helps models surpass standard SFT. Our code is available at https://anonymous.4open.science/r/DART525E50D.

Data Curation & Synthetic Data Reasoning & Chain-of-Thought Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning to Adapt SFT Data for Better Reasoning Generalization

Related Papers