Mar 30, 2026arXiv:2603.28515

EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces

Léane Jourdan, Julien Aubert-Béduchaud, Julien Aubert-B'educhaud, Y. Chupin, Yannis Chupin, Marah Baccari, Florian Boudin

AI Summary

The authors introduce EarlySciRev, a dataset of 578k validated scientific text revision pairs extracted from arXiv LaTeX source files by identifying and aligning commented-out text with nearby final text. This dataset captures early-stage revisions, a previously under-explored area, offering a unique resource for studying the evolution of scientific writing. They also provide a human-annotated benchmark for revision detection, enabling further research on revision modeling and LLM-assisted editing.

Key Contribution

Unlock the secrets of scientific writing: EarlySciRev reveals how scientists *really* revise their work, offering a goldmine of early-stage revisions previously hidden in LaTeX comments.

Abstract

Scientific writing is an iterative process that generates rich revision traces, yet publicly available resources typically expose only final or near-final versions of papers. This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing. We introduce EarlySciRev, a dataset of early-stage scientific text revisions automatically extracted from arXiv LaTeX source files. Our key observation is that commented-out text in LaTeX often preserves discarded or alternative formulations written by the authors themselves. By aligning commented segments with nearby final text, we extract paragraph-level candidate revision pairs and apply LLM-based filtering to retain genuine revisions. Starting from 1.28M candidate pairs, our pipeline yields 578k validated revision pairs, grounded in authentic early drafting traces. We additionally provide a human-annotated benchmark for revision detection. EarlySciRev complements existing resources focused on late-stage revisions or synthetic rewrites and supports research on scientific writing dynamics, revision modelling, and LLM-assisted editing.

Data Curation & Synthetic Data Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces

Related Papers