ColumbiaManuscript received February 24Feb 8, 2026arXiv:2602.07824

Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training

Yiwei Qin, Zhen Huang, Tiantian Mi, Weiye Si, Chenyang Zhou, Qipeng Guo, Siyuan Feng, Pengfei Liu

AI Summary

The paper introduces Data Darwinism, a ten-level taxonomy for data-model co-evolution, and validates it by constructing Darwin-Science, a 900B-token corpus of scientific literature processed up to level L5 using LLMs for generative refinement and cognitive completion. They pre-trained daVinci-origin-3B/7B models on Darwin-Science, demonstrating significant performance gains over contamination-free baselines, particularly on domain-aligned tasks. The results confirm that higher-level data processing, as defined by the Data Darwinism taxonomy, unlocks latent value in scientific text for pre-training.

Key Contribution

Frontier LLMs can unlock substantial performance gains in scientific domains by refining and completing raw scientific text, leading to a +8.40 point improvement on domain-aligned tasks.

Abstract

Data quality determines foundation model performance, yet systematic processing frameworks are lacking. We introduce Data Darwinism, a ten-level taxonomy (L0-L9) that conceptualizes data-model co-evolution: advanced models produce superior data for next-generation systems. We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus (L0-L5). We identify a learnability gap in raw scientific text, which we bridge via L4 (Generative Refinement) and L5 (Cognitive Completion) using frontier LLMs to explicate reasoning and terminology. To ensure rigorous attribution, we pre-trained daVinci-origin-3B/7B models from scratch, excluding scientific content to create contamination-free baselines. After 600B tokens of continued pre-training, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points across 20+ benchmarks, rising to +5.60 and +8.40 points on domain-aligned tasks. Systematic progression to L5 yields a +1.36 total gain, confirming that higher-level processing unlocks latent data value. We release the Darwin-Science corpus and daVinci-origin models to enable principled, co-evolutionary development.

Data Curation & Synthetic Data Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References65

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training

Related Papers