Search papers, labs, and topics across Lattice.
This paper introduces chain-of-thought ASR (CoT-ASR), a novel approach that leverages LLMs to first analyze speech input and generate contextual analysis before performing speech recognition. To bridge the modality gap, they propose a CTC-guided Modality Adapter that aligns speech encoder outputs with the LLM's textual latent space. Experiments demonstrate that CoT-ASR achieves significant improvements over standard LLM-based ASR, with an 8.7% relative reduction in WER and a 16.9% relative reduction in EER.
Unleashing LLMs' reasoning powers on speech unlocks a new ASR paradigm, slashing error rates by up to 17% simply by having the model "think" before transcribing.
Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM's textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).