Search papers, labs, and topics across Lattice.
SlideFormer, a novel system, is introduced to enable fine-tuning of large language models (LLMs) on a single GPU by using a lightweight asynchronous engine that overlaps GPU computation with CPU updates and multi-tier I/O. It incorporates a heterogeneous memory management scheme to reduce peak memory usage and optimized Triton kernels to address performance bottlenecks. The system achieves up to 6.27x higher throughput and reduces memory usage by roughly half compared to baselines, enabling fine-tuning of 123B+ models on a single RTX 4090.
Fine-tune 123B+ parameter models on a single RTX 4090 with SlideFormer, a system that achieves up to 6x larger models and 8x larger batch sizes.
Fine-tuning Large Language Models (LLMs) has become essential for domain adaptation, but its memory-intensive property exceeds the capabilities of most GPUs. To address this challenge and democratize LLM fine-tuning, we present SlideFormer, a novel system designed for single-GPU environments. Our innovations are: (1) A lightweight asynchronous engine that treats the GPU as a sliding window and overlaps GPU computation with CPU updates and multi-tier I/O. (2) A highly efficient heterogeneous memory management scheme significantly reduces peak memory usage. (3) Optimized Triton kernels to solve key bottlenecks and integrated advanced I/O. This collaborative design enables fine-tuning of the latest 123B+ models on a single RTX 4090, supporting up to 8x larger batch sizes and 6x larger models. In evaluations, SlideFormer achieves 1.40x to 6.27x higher throughput while roughly halving CPU/GPU memory usage compared to baselines, sustaining>95% peak performance on both NVIDIA and AMD GPUs.