Machine Learning Optimization LabApr 14, 2026arXiv:2604.12973

An Engineering Journey Training Large Language Models at Scale on Alps: The Apertus Experience

Jonathan Coles, Jonathan Coles, Stefano Schuppli, Stefano Schuppli, Lukas Drescher, Lukas Drescher, Fawzi Roberto Mohamed, Fawzi Mohamed, Elia Palme, Elia Palme, Henrique Mendoncca, Henrique Mendonça, Miguel Gila, Miguel Gila, Mark Klein, Mark Klein, Maxime Martinasso, Maxime Martinasso, Joost VandeVondele, Joost VandeVondele, Torsten Hoefler, Torsten Hoefler, Thomas Schulthess, T. Schulthess, Josh Romero, Ig Gorodetsky, Igor Gorodetsky, Ryan Hankins, R. Hankins, Isa Wazirzada, Isa Wazirzada, Martin Jaggi, Martin Jaggi, Antoine Bosselut, Antoine Bosselut, Imanol Schlag, Imanol Schlag, Antoni-Joan Solergibert i Llaquet, Antoni-Joan Solergibert i Llaquet, Alejandro Hernández Cano, Alejandro Hern'andez Cano, Theofilos Ioannis Manitaras, Theofilos-Ioannis Manitaras, Nicholas John Browning, Nicholas Browning

AI Summary

This paper details the engineering efforts to train Apertus, a 70B parameter multilingual foundation model, on the Alps supercomputer, marking a first for academia at this scale. The authors addressed challenges in adapting HPC infrastructure for LLM training, including storage bottlenecks and interconnect stability. Their work provides a blueprint for public institutions aiming to develop sovereign AI capabilities by transforming supercomputers into robust ML platforms capable of sustained, iterative model development.

Key Contribution

Training a 70B parameter open-source LLM on a supercomputer reveals the hidden engineering hurdles and infrastructure adaptations needed to democratize large-scale AI development beyond the private sector.

Abstract

Large Language Models (LLMs) have surged as a transformative technology for science and society, prompting governments worldwide to pursue sovereign AI capabilities that ensure data compliance and cultural representation. However, the associated capital costs and engineering complexity required to train these models have largely restricted such capabilities to the private sector, leaving a significant gap for public institutions. This paper details the engineering journey behind training \textit{Apertus}, a fully open multilingual foundation model, on the \textit{Alps} supercomputer. Representing a first-of-its-kind achievement for academia at the 70B parameter scale, we successfully deployed a massive pre-training campaign on one of Europe's largest systems for open science, powered by NVIDIA GH200 Grace Hopper Superchips. We detail the challenges encountered in readying HPC infrastructure for training AI models, from overcoming storage bottlenecks to stabilizing large-scale interconnects, and the lessons learned in transforming a supercomputer into a resilient software-defined Machine Learning Platform. Finally, we discuss the post-training requirements and evolution of our Machine Learning platform, outlining how this initial release lays the groundwork for a sustained, iterative operational capability, in particular for fine tuning foundation models, that extends well beyond a single model training run.

Distributed Systems & Hardware Open-Source Models & Weights Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

An Engineering Journey Training Large Language Models at Scale on Alps: The Apertus Experience

Related Papers