TU BerlinApr 29, 2026arXiv:2604.26881

FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving

Minghe Wang, Trever Schirmer, Mohammadreza Malekabbasi, David Bermbach

AI Summary

The paper introduces FaaSMoE, a serverless architecture for multi-tenant Mixture-of-Experts (MoE) serving that leverages Function-as-a-Service (FaaS) to decouple control and execution planes. By deploying experts as stateless FaaS functions, FaaSMoE enables on-demand expert invocation and scale-to-zero capabilities, improving resource utilization in multi-tenant scenarios. Evaluation with Qwen1.5-moe-2.7B shows that FaaSMoE reduces resource usage to less than one-third compared to a full-model baseline.

Key Contribution

Slash MoE serving costs by two-thirds with FaaSMoE, a serverless architecture that dynamically scales experts on demand.

Abstract

Mixture-of-Experts (MoE) models offer high capacity with efficient inference cost by activating a small subset of expert models per input. However, deploying MoE models requires all experts to reside in memory, creating a gap between the resource used by activated experts and the provisioned resources. This underutilization is further pronounced in multi-tenant scenarios. In this paper, we propose FaaSMoE, a multi-tenant MoE serving architecture built on Function-as-a-Service (FaaS) platforms. FaaSMoE decouples the control and execution planes of MoE by deploying experts as stateless FaaS functions, enabling on-demand and scale-to-zero expert invocation across tenants. FaaSMoE further supports configurable expert granularity within functions, trading off per-expert elasticity for reduced invocation overhead. We implement a prototype with an open-source edge-oriented FaaS platform and evaluate it using Qwen1.5-moe-2.7B under multi-tenant workloads. Compared to a full-model baseline, FaaSMoE uses less than one third of the resources, demonstrating a practical and resource-efficient path towards scalable MoE serving in a multi-tenant environment.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving

Related Papers