Search papers, labs, and topics across Lattice.
This paper introduces a projector-based framework for integrating large language models into automatic speech recognition (ASR), addressing challenges in multilingual generalization and modality alignment. By employing a Mixture of Experts (MoE) architecture alongside a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling, the authors achieve significant performance enhancements over existing baseline models. The results indicate that this approach not only improves accuracy but also enhances the robustness and generalizability of LLM-based ASR systems across multiple languages.
Leveraging a Mixture of Experts and dynamic downsampling, this framework boosts multilingual ASR performance beyond traditional models.
The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. Our approach incorporates a Mixture of Experts (MoE) architecture to improve cross-lingual adaptability, and a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling and modality alignment. Experimental results show that the combination of these components yields substantial performance improvements, surpassing strong baseline models. The proposed method represents a step toward building more accurate, robust, and generalizable LLM-based ASR systems.