Apr 16, 2026arXiv:2604.15009

Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching

AI Summary

The paper introduces Mixture-of-Experts Flow Matching (MoE-FM) to address limitations of flow matching in language modeling, specifically its inability to represent complex latent distributions. MoE-FM decomposes global transport geometries into locally specialized vector fields, enabling the capture of complex latent spaces. The resulting non-autoregressive language model, YAN, achieves comparable generation quality to AR and diffusion models with significantly fewer sampling steps, yielding up to 40x speedup over AR and 1000x over diffusion models.

Key Contribution

Flow matching, now enhanced with a mixture-of-experts approach, lets you generate text as fast as 3 steps, rivaling autoregressive models in quality while being orders of magnitude faster than diffusion.

Abstract

Flow matching retains the generation quality of diffusion models while enabling substantially faster inference, making it a compelling paradigm for generative modeling. However, when applied to language modeling, it exhibits fundamental limitations in representing complex latent distributions with irregular geometries, such as anisotropy and multimodality. To address these challenges, we propose a mixture-of-experts flow matching (MoE-FM) framework, which captures complex global transport geometries in latent space by decomposing them into locally specialized vector fields. Building on MoE-FM, we develop a non-autoregressive (NAR) language modeling approach, named YAN, instantiated with both Transformer and Mamba architectures. Across multiple downstream tasks, YAN achieves generation quality on par with both autoregressive (AR) and diffusion-based NAR language models, while requiring as few as three sampling steps. This yields a $40\times$ speedup over AR baselines and up to a $10^3\times$ speedup over diffusion language models, demonstrating substantial efficiency advantages for language modeling.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References83

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching

Related Papers