Fondazione Bruno KesslerApr 1, 2026arXiv:2604.00610

Speech LLMs are Contextual Reasoning Transcribers

Keqi Deng, Ruchao Fan, Bo Ren, Yiming Wang, Jinyu Li

AI Summary

This paper introduces chain-of-thought ASR (CoT-ASR), a novel approach that leverages LLMs to first analyze speech input and generate contextual analysis before performing speech recognition. To bridge the modality gap, they propose a CTC-guided Modality Adapter that aligns speech encoder outputs with the LLM's textual latent space. Experiments demonstrate that CoT-ASR achieves significant improvements over standard LLM-based ASR, with an 8.7% relative reduction in WER and a 16.9% relative reduction in EER.

Key Contribution

Unleashing LLMs' reasoning powers on speech unlocks a new ASR paradigm, slashing error rates by up to 17% simply by having the model "think" before transcribing.

Abstract

Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM's textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).

Natural Language Processing Reasoning & Chain-of-Thought Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Speech LLMs are Contextual Reasoning Transcribers

Related Papers