Search papers, labs, and topics across Lattice.
This paper details the engineering efforts to train Apertus, a 70B parameter multilingual foundation model, on the Alps supercomputer, marking a first for academia at this scale. The authors addressed challenges in adapting HPC infrastructure for LLM training, including storage bottlenecks and interconnect stability. Their work provides a blueprint for public institutions aiming to develop sovereign AI capabilities by transforming supercomputers into robust ML platforms capable of sustained, iterative model development.
Training a 70B parameter open-source LLM on a supercomputer reveals the hidden engineering hurdles and infrastructure adaptations needed to democratize large-scale AI development beyond the private sector.
Large Language Models (LLMs) have surged as a transformative technology for science and society, prompting governments worldwide to pursue sovereign AI capabilities that ensure data compliance and cultural representation. However, the associated capital costs and engineering complexity required to train these models have largely restricted such capabilities to the private sector, leaving a significant gap for public institutions. This paper details the engineering journey behind training \textit{Apertus}, a fully open multilingual foundation model, on the \textit{Alps} supercomputer. Representing a first-of-its-kind achievement for academia at the 70B parameter scale, we successfully deployed a massive pre-training campaign on one of Europe's largest systems for open science, powered by NVIDIA GH200 Grace Hopper Superchips. We detail the challenges encountered in readying HPC infrastructure for training AI models, from overcoming storage bottlenecks to stabilizing large-scale interconnects, and the lessons learned in transforming a supercomputer into a resilient software-defined Machine Learning Platform. Finally, we discuss the post-training requirements and evolution of our Machine Learning platform, outlining how this initial release lays the groundwork for a sustained, iterative operational capability, in particular for fine tuning foundation models, that extends well beyond a single model training run.