Apr 15, 2026arXiv:2604.14258

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Wangjie Gan, Miao Pan, Linbo Xi, Wenqi Zhang, Wenqiang Zhang, Jintao Chen, Jianwei Yin, Xuhong Zhang

AI Summary

The paper introduces Group Fine-Tuning (GFT), a post-training framework for LLMs designed to overcome the limitations of SFT by addressing reward sparsity and unstable inverse-probability weighting. GFT employs Group Advantage Learning to create diverse response groups and normalized contrastive supervision, along with Dynamic Coefficient Rectification to stabilize optimization. Experiments show GFT outperforms SFT and integrates better with subsequent RL training.

Key Contribution

SFT's instability and reward sparsity can be overcome with a novel Group Fine-Tuning (GFT) framework, leading to better LLM policies.

Abstract

Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.

Natural Language Processing RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References41

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Related Papers