Search papers, labs, and topics across Lattice.
The authors introduce EarlySciRev, a dataset of 578k validated scientific text revision pairs extracted from arXiv LaTeX source files by identifying and aligning commented-out text with nearby final text. This dataset captures early-stage revisions, a previously under-explored area, offering a unique resource for studying the evolution of scientific writing. They also provide a human-annotated benchmark for revision detection, enabling further research on revision modeling and LLM-assisted editing.
Unlock the secrets of scientific writing: EarlySciRev reveals how scientists *really* revise their work, offering a goldmine of early-stage revisions previously hidden in LaTeX comments.
Scientific writing is an iterative process that generates rich revision traces, yet publicly available resources typically expose only final or near-final versions of papers. This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing. We introduce EarlySciRev, a dataset of early-stage scientific text revisions automatically extracted from arXiv LaTeX source files. Our key observation is that commented-out text in LaTeX often preserves discarded or alternative formulations written by the authors themselves. By aligning commented segments with nearby final text, we extract paragraph-level candidate revision pairs and apply LLM-based filtering to retain genuine revisions. Starting from 1.28M candidate pairs, our pipeline yields 578k validated revision pairs, grounded in authentic early drafting traces. We additionally provide a human-annotated benchmark for revision detection. EarlySciRev complements existing resources focused on late-stage revisions or synthetic rewrites and supports research on scientific writing dynamics, revision modelling, and LLM-assisted editing.