Search papers, labs, and topics across Lattice.
This paper introduces FMplex, a model-serving system that virtualizes foundation models (FMs) to optimize resource usage by allowing multiple customized tasks to share a single backbone while maintaining task-specific extensions and isolation. By implementing a batch-aware fair-queueing scheduler, FMplex significantly reduces latency and increases the number of tasks that can be hosted simultaneously. The results demonstrate an impressive latency reduction of up to 80% compared to spatial partitioning and 33.3% compared to best-effort co-location, showcasing the efficiency of this approach across various FM backbones and downstream tasks.
FMplex achieves up to 80% lower latency while hosting six times more tasks by virtualizing foundation models for efficient resource sharing.
Foundation models (FMs) are increasingly used as backbones for downstream tasks across language, vision, time-series, and multimodal applications. Yet existing model-serving systems deploy each customized task as an independent model instance, thereby replicating heavyweight backbones, wasting accelerator memory, and losing opportunities to amortize batching and loading costs. This paper presents FMplex, a serving system that treats FM backbones as a virtualization substrate for deployment sharing. FMplex presents each task with a virtual foundation model (vFM), a logically private FM instance backed by a shared physical FM. This abstraction lets independently customized tasks share a backbone while preserving task-specific extensions, independent lifecycles, and task-level isolation. In addition, we propose a batch-aware fair-queueing scheduler that combines weighted task-level sharing with inter- and intra-task batching across colocated tasks. We implement a FMplex-based serving stack spanning task construction, sharing-aware deployment, and runtime execution. Across 7 FM backbones (16 variants) and 92 downstream tasks, FMplex reduces latency by up to 80% over spatial partitioning and 33.3% over best-effort co-location, while hosting up to 6x more tasks at cluster scale.