Search papers, labs, and topics across Lattice.
The paper investigates the impact of replaying generic pre-training data during the fine-tuning of language models for specific target domains. Contrary to the conventional approach of only using generic data to prevent catastrophic forgetting, the authors find that replaying this data can significantly improve performance on the target task, especially when the target data is limited. Through controlled experiments with 150M and 8B parameter models, they demonstrate improved data efficiency and performance gains on tasks like agentic web navigation and Basque question-answering.
Replaying generic pre-training data during fine-tuning boosts target task performance by up to 2x, challenging the common practice of minimizing its use.
To obtain a language model for a target domain (e.g. math), the current paradigm is to pre-train on a vast amount of generic web text and then fine-tune on the relatively limited amount of target data. Typically, generic data is only mixed in during fine-tuning to prevent catastrophic forgetting of the generic domain. We surprisingly find that replaying the generic data during fine-tuning can actually improve performance on the (less related) target task. Concretely, in a controlled pre-training environment with 4M target tokens, 4B total tokens, and 150M parameter models, generic replay increases target data efficiency by up to $1.87\times$ for fine-tuning and $2.06\times$ for mid-training. We further analyze data schedules that introduce target data during pre-training and find that replay helps more when there is less target data present in pre-training. We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by $4.5\%$ and Basque question-answering accuracy by $2\%$.