Stanford HAIMar 19, 2026arXiv:2603.18534

Data-efficient pre-training by scaling synthetic megadocs

K. Kim, Konwoo Kim, Suhas Kotha, Yejin Choi, Tatsunori Hashimoto, Nick Haber, Percy Liang

AI Summary

This paper investigates synthetic data augmentation techniques for pre-training language models, focusing on improving data efficiency. They demonstrate that pre-training on a mixture of web data and synthetic rephrases improves validation loss and benchmark accuracy. The key finding is that constructing "megadocs" by stitching rephrases or inserting rationales from the same source document leads to superior loss scaling and long-context performance compared to simple rephrasing, achieving up to 1.8x data efficiency.

Key Contribution

Forget rephrasing: stitching synthetic text into "megadocs" unlocks surprisingly better pre-training, especially for long-context tasks, and keeps improving as you scale.

Abstract

Synthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different distribution. With optimal mixing and epoching, loss and benchmark accuracy improve without overfitting as the number of synthetic generations grows, plateauing near $1.48\times$ data efficiency at 32 rephrases per document. We find even better loss scaling under a new perspective: synthetic generations from the same document can form a single substantially longer megadocument instead of many short documents. We show two ways to construct megadocs: stitching synthetic rephrases from the same web document or stretching a document by inserting rationales. Both methods improve i.i.d. loss, downstream benchmarks, and especially long-context loss relative to simple rephrasing, increasing data efficiency from $1.48\times$ to $1.80\times$ at $32$ generations per document. Importantly, the improvement of megadocs over simple rephrasing widens as more synthetic data is generated. Our results show how to design synthetic data algorithms that benefit more from increasing compute when data-constrained.

Data Curation & Synthetic Data Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Data-efficient pre-training by scaling synthetic megadocs

Related Papers