Feb 25, 2026arXiv:2602.22014

A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT

Louis Estève, Louis Estève, Christophe Servan, Christophe Servan, Thomas Lavergne, Thomas Lavergne, Agata Savary, Agata Savary

AI Summary

This paper investigates the impact of diversity-driven data sampling on the pre-training of French ModernBERT, aiming to reduce dataset size while maintaining performance. The authors compared several diversity-driven sampling algorithms against random sampling. They found that diversity-driven sampling can achieve up to a 10-point performance gain on some tasks compared to randomly sampled data of similar size, and that a model trained on a smaller, diversity-driven dataset can match the performance of a model trained on a much larger, randomly sampled dataset.

Key Contribution

You can slash ModernBERT's pre-training data by over 90% and still match performance, simply by prioritizing dataset diversity.

Abstract

Diversity has been gaining interest in the NLP community in recent years. At the same time, state-of-the-art transformer models such as ModernBERT use very large pre-training datasets, which are driven by size rather than by diversity. This summons for an investigation of the impact of diversity on the ModernBERT pre-training. We do so in this study, with the express intent of reducing pre-training dataset size, while retaining at least comparable performance. We compare diversity-driven sampling algorithms, so as to pick the best one. We find that diversity-driven sampling allows in some tasks to gain 10 points relative to randomly-sampled pre-training data of commensurate size. We also see that a model pre-trained for 483h on a diversity-driven dataset of 150M tokens can yield a commensurate performance to a model pre-trained for 1,775h on a randomly-driven dataset of 2.4B tokens.

Data Curation & Synthetic Data Natural Language Processing Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References76

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT

Related Papers