IBM ResearchApr 14, 2026arXiv:2604.12398

Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction

Sashi Novitasari, Takashi Fukuda, Kurata Gakuto, G. Saon, George Saon

AI Summary

This paper explores contextual biasing for speech-aware LLMs (SLLMs) to improve the recognition of rare or unseen "bias words" without relying on G2P systems or phonetic knowledge. They leverage acoustic cues from common words with similar pronunciations to the bias words, and introduce a multi-output learning approach for bias word position prediction. Experiments demonstrate a 16.3% reduction in bias word recognition errors compared to baseline systems, even on out-of-domain data.

Key Contribution

Forget phoneme sequences and G2P systems: this work shows you can boost ASR accuracy for rare words by cleverly leveraging acoustic cues from common words with similar sounds.

Abstract

Speech-aware LLMs (SLLMs) have recently achieved state-of-the-art ASR performance; however, they still fail to accurately transcribe bias words that appear rarely or never in the training data. Contextual biasing mechanisms are commonly implemented by introducing a predefined bias word list into the model via a text prompt or additional module. For further improvement, predefined bias words can be paired with their phoneme representations as pronunciation cues. Typically, phoneme sequences are generated through a G2P system that covers the target languages and domains of the bias words. Therefore, when a compatible G2P system is unavailable, phoneme-assisted contextual biasing becomes difficult to perform. Moreover, manually adding accurate phoneme sequences requires advanced phonetic knowledge. In this paper, we explore contextual biasing in SLLM based on acoustic cues associated with a set of common words whose pronunciations are partially similar to those of the target bias words. We assume ASR applications in which end users do not require special knowledge of phonetics or utilize G2P tools for inference. For enhanced robustness, we also introduce bias word positional prediction implemented in a multi-output learning fashion. Our method reduces bias word recognition errors by 16.3% compared to baseline systems, including on out-of-domain data.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References27

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction

Related Papers