Search papers, labs, and topics across Lattice.
This paper introduces InfiniLoRA, a disaggregated LoRA serving system designed to address the scalability and tail-latency challenges of serving LoRA adapters for large language models, particularly those with MoE architectures. InfiniLoRA decouples LoRA execution from base model inference, using a shared LoRA server with parallelism-aware execution and SLO-driven provisioning. Results demonstrate a 3.05x increase in serviceable request rate and a 54% improvement in SLO satisfaction for LoRA adapters compared to coupled serving designs.
Serving LoRA adapters at scale doesn't have to crush your latency SLOs: InfiniLoRA disaggregates LoRA execution to achieve 3x higher throughput and dramatically improved tail latency.
LoRA enables efficient customization of LLMs and is widely used in multi-tenant and multi-task serving. However, emerging model architectures such as MoE significantly increase LoRA memory cost, making existing coupled LoRA serving designs poorly scalable and prone to tail-latency inflation. We present InfiniLoRA, a disaggregated LoRA serving system that decouples LoRA execution from base-model inference. InfiniLoRA introduces a shared LoRA Server with parallelism-aware execution, SLO-driven provisioning, and critical-path optimizations, including GPU-initiated communication and hardware-specialized LoRA kernels. Experiments show that InfiniLoRA can achieve an average $3.05\times$ increase in serviceable request rate under strict latency SLOs, and improve the percentage of LoRA adapters satisfying the SLO requirement by 54.0\%.