Search papers, labs, and topics across Lattice.
This paper presents a case study of training a 1.36B-parameter scientific language model from raw arXiv LaTeX sources, detailing an end-to-end pipeline encompassing data preprocessing, domain-aware tokenization, and transformer training. The study analyzes training stability, scaling behavior, and data yield losses under constrained compute (2xA100 GPUs), revealing the significant impact of preprocessing and tokenization choices on token volume and symbolic stability. The authors demonstrate stable training with 52B tokens, providing practical insights for researchers training domain-specific models with limited resources.
Training a scientific language model isn't just about compute – preprocessing choices and I/O bottlenecks can make or break your domain-specific LLM.
While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented. In this work, we present a detailed case study of training a 1.36B-parameter scientific language model directly from raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics. We describe an end-to-end pipeline covering metadata filtering, archive validation, LaTeX extraction, text normalization, domain-aware tokenization, and dense transformer training under constrained compute (2xA100 GPUs). Through 24 experimental runs, we analyze training stability, scaling behavior, data yield losses, and infrastructure bottlenecks. Our findings highlight how preprocessing decisions significantly affect usable token volume, how tokenization impacts symbolic stability, and how storage and I/O constraints can rival compute as limiting factors. We further analyze convergence dynamics and show stable training behavior in a data-rich regime (52B pretraining tokens). Rather than proposing a novel architecture, this work provides an engineering-grounded, transparent account of training a small scientific language model from scratch. We hope these insights support researchers operating under moderate compute budgets who seek to build domain-specialized models.