NUSByteDanceHKUSTSJTUSMUApr 8, 2026arXiv:2604.07173

InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models

Hongyu Chen, Letian Ruan, Zilin Xu, Yuchen Li, Xinyu Chen, Jingwen Leng, Minyi Guo, Shixuan Sun

AI Summary

This paper introduces InfiniLoRA, a disaggregated LoRA serving system designed to address the scalability and tail-latency challenges of serving LoRA adapters for large language models, particularly those with MoE architectures. InfiniLoRA decouples LoRA execution from base model inference, using a shared LoRA server with parallelism-aware execution and SLO-driven provisioning. Results demonstrate a 3.05x increase in serviceable request rate and a 54% improvement in SLO satisfaction for LoRA adapters compared to coupled serving designs.

Key Contribution

Serving LoRA adapters at scale doesn't have to crush your latency SLOs: InfiniLoRA disaggregates LoRA execution to achieve 3x higher throughput and dramatically improved tail latency.

Abstract

LoRA enables efficient customization of LLMs and is widely used in multi-tenant and multi-task serving. However, emerging model architectures such as MoE significantly increase LoRA memory cost, making existing coupled LoRA serving designs poorly scalable and prone to tail-latency inflation. We present InfiniLoRA, a disaggregated LoRA serving system that decouples LoRA execution from base-model inference. InfiniLoRA introduces a shared LoRA Server with parallelism-aware execution, SLO-driven provisioning, and critical-path optimizations, including GPU-initiated communication and hardware-specialized LoRA kernels. Experiments show that InfiniLoRA can achieve an average $3.05\times$ increase in serviceable request rate under strict latency SLOs, and improve the percentage of LoRA adapters satisfying the SLO requirement by 54.0\%.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models

Related Papers