ECNUUC Santa CruzApr 20, 2026arXiv:2604.18105

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

Yuan Xie, Jiaqi Song, Guanghui Qiu, Guang Qiu, Xianliang Wang, Kai Qiao, Junfeng Yuan, Sheng Liu, Shengqing Liu, Yi Zhang, Ming Lei, Jie Gao, Jie Wu

AI Summary

NIM4-ASR, a production-oriented LLM-based ASR framework, is introduced to address scalability and hallucination issues in existing LLM-ASR systems. The framework employs a redesigned multi-stage training paradigm, including pre-training with a reformulated architecture, iterative asynchronous SFT, and ASR-specialized reinforcement learning. NIM4-ASR achieves state-of-the-art performance on public benchmarks with only 2.3B parameters and demonstrates superior performance on internal benchmarks, particularly in entity-intensive scenarios, while also supporting million-scale hotword customization via RAG.

Key Contribution

LLM-based ASR can be shrunk to 2.3B parameters and still beat larger models in real-world scenarios by carefully delineating encoder and LLM roles and using a multi-stage training approach.

Abstract

Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks -- particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.

Inference & Quantization Natural Language Processing Scaling Laws & Emergent Abilities Speech & Audio

Citation Metrics

Citations0

Influential citations0

References54

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

Related Papers