ETHApr 14, 2026arXiv:2604.12599

Beyond Pre-Training: The Full Lifecycle of Foundation Models on HPC Systems

Stefano Schuppli, Joost VandeVondele, Maxime Martinasso

AI Summary

This paper examines the challenges of supporting the full lifecycle of foundation models (pre-training, fine-tuning, and inference) on HPC systems, which traditionally focus on pre-training. It presents a hybrid cloud-native platform developed at the Swiss National Supercomputing Centre (CSCS) that combines GPU-enabled HPE Cray EX compute nodes with virtualized commodity infrastructure, orchestrated by Kubernetes, to bridge the gap between HPC batch processing and service-oriented workflows. The paper reports on initial investigations into fine-tuning pipelines and inference services, providing a blueprint for integrating "AI Factories" services into supercomputers.

Key Contribution

Supercomputers can evolve beyond just pre-training to become comprehensive "AI Factories" by adopting hybrid cloud-native architectures that support the entire lifecycle of foundation models.

Abstract

Large-scale pre-training of Foundational Models (FM) constitutes a computationally intensive first phase for enabling AI across diverse scientific and societal applications. This first phase has positioned High-Performance Computing (HPC) facilities as indispensable backbones of"Sovereign AI"initiatives. While the massive throughput requirements of FM pre-training align with the traditional capability-oriented mission of HPC, subsequent phases of the AI lifecycle, typically referred to as fine-tuning and inference, introduce operational paradigms that can conflict with established batch-processing environments. Moreover, these phases are not computationally trivial: they often require substantial high-end compute resources while exhibiting hardware utilization patterns that differ significantly from those of pre-training. This paper addresses the architectural and strategic challenges of operationalizing a complete AI lifecycle within a national supercomputing facility. We present a hybrid cloud-native platform being developed and deployed at the Swiss National Supercomputing Centre (CSCS) that combines diskless GPU-enabled HPE Cray EX compute nodes with virtualized commodity infrastructure. Orchestrated by Kubernetes, this novel service architecture bridges the gap between HPC batch processing and service-oriented workflows. We report our initial investigations into fine-tuning pipelines and highly available inference services, analyzing the associated trade-offs while improving user productivity. Our findings offer a blueprint for enabling supercomputers to integrate"AI Factories"services and workflows, supporting AI innovations into end-to-end scientific and industrial use cases.

Distributed Systems & Hardware Scientific Discovery & Drug Design Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond Pre-Training: The Full Lifecycle of Foundation Models on HPC Systems

Related Papers