Search papers, labs, and topics across Lattice.
This paper explores the benefits of incorporating multi-turn conversational context into LLM-based Automatic Speech Recognition (ASR) systems, finding that it primarily improves recognition of contextual entities. To address the computational cost of processing long audio sequences from prior turns, they introduce "Abstract Compression," which replaces prior-turn audio with a fixed number of learned latent tokens while retaining transcripts. Experiments on in-domain and out-of-domain datasets demonstrate that the compressed model recovers some of the accuracy gains from raw-context conditioning, but with a significantly reduced audio footprint.
LLM-based ASR can get a context boost without the compute cost: compress prior audio turns into learned latent tokens and retain transcripts to recover accuracy while shrinking the audio footprint.
Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.