EPFLApr 15, 2026arXiv:2604.13627

(How) Learning Rates Regulate Catastrophic Overtraining

Mark Rofin, Mark Rofin, A. Varre, Aditya Varre, Nicolas Flammarion, Nicolas Flammarion

AI Summary

This paper investigates catastrophic overtraining in LLMs during supervised fine-tuning (SFT) by examining the role of the learning rate. They show that different learning rates during SFT lead to qualitatively different models despite converging to similar SFT loss values. The key finding is that learning rate decay increases model sharpness, which then exacerbates catastrophic forgetting during SFT, ultimately causing overtraining.

Key Contribution

Learning rate decay, a common optimization technique, might be the culprit behind catastrophic forgetting in LLMs during fine-tuning.

Abstract

Supervised fine-tuning (SFT) is a common first stage of LLM post-training, teaching the model to follow instructions and shaping its behavior as a helpful assistant. At the same time, SFT may harm the fundamental capabilities of an LLM, particularly after long pretraining: a phenomenon known as catastrophic overtraining (Springer et al., 2025). To understand overtraining, we first investigate catastrophic forgetting in finetuning through the lens of implicit regularization of the learning rate. For models trained to the same SFT loss, we identify how the learning rate mediates optimization: finetuning with large and small steps converges to qualitatively different models. Next, we link forgetting to overtraining: learning rate decay increases the sharpness of the pretrained model, which in turn exacerbates catastrophic forgetting during SFT, leading to overtraining. Our findings paint a picture of the overtraining mechanism in LLMs and broadly contribute to the understanding of the interplay between optimization dynamics during pretraining and finetuning.

Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

(How) Learning Rates Regulate Catastrophic Overtraining

Related Papers