Tsinghua AIBeijing Haitian Ruisheng ScienceJun 9, 2026arXiv:2606.10439

Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling

Guodong Lin, Ziqi Chen, Yuxiang Fu, Ke Li, Wei-Qiang Zhang

AI Summary

This paper introduces a projector-based framework for integrating large language models into automatic speech recognition (ASR), addressing challenges in multilingual generalization and modality alignment. By employing a Mixture of Experts (MoE) architecture alongside a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling, the authors achieve significant performance enhancements over existing baseline models. The results indicate that this approach not only improves accuracy but also enhances the robustness and generalizability of LLM-based ASR systems across multiple languages.

Key Contribution

Leveraging a Mixture of Experts and dynamic downsampling, this framework boosts multilingual ASR performance beyond traditional models.

Abstract

The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. Our approach incorporates a Mixture of Experts (MoE) architecture to improve cross-lingual adaptability, and a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling and modality alignment. Experimental results show that the combination of these components yields substantial performance improvements, surpassing strong baseline models. The proposed method represents a step toward building more accurate, robust, and generalizable LLM-based ASR systems.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling

Related Papers