Search papers, labs, and topics across Lattice.
NIM4-ASR, a production-oriented LLM-based ASR framework, is introduced to address scalability and hallucination issues in existing LLM-ASR systems. The framework employs a redesigned multi-stage training paradigm, including pre-training with a reformulated architecture, iterative asynchronous SFT, and ASR-specialized reinforcement learning. NIM4-ASR achieves state-of-the-art performance on public benchmarks with only 2.3B parameters and demonstrates superior performance on internal benchmarks, particularly in entity-intensive scenarios, while also supporting million-scale hotword customization via RAG.
LLM-based ASR can be shrunk to 2.3B parameters and still beat larger models in real-world scenarios by carefully delineating encoder and LLM roles and using a multi-stage training approach.
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks -- particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.