Apr 6, 2026arXiv:2604.05091

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Zhengqing Yuan, Hanchi Sun, Lichao Sun, Yanfang Ye

AI Summary

MegaTrain is introduced, a memory-centric system for training >100B parameter LLMs in full precision on a single GPU by storing parameters in CPU memory and streaming them to the GPU for computation. To overcome the CPU-GPU bandwidth bottleneck, MegaTrain uses a pipelined double-buffered execution engine and replaces persistent autograd graphs with stateless layer templates. Experiments show MegaTrain achieves 1.84x the throughput of DeepSpeed ZeRO-3 with CPU offloading for 14B models and enables 7B model training with 512k context on a single GH200.

Key Contribution

Training massive LLMs on a single GPU is now possible, potentially democratizing access to large-scale model development.

Abstract

We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84$\times$ the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References17

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Related Papers