EPFLIdiapUZHMar 27, 2026arXiv:2603.26246

Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR

Shashi Kumar, Esaú Villatoro-Tello, Sergio Gastón Burdisso, Kadri Hacioglu, Thibault Bañeras-Roux, Hasindri Watawana, Dairazalia Sanchez-Cortes, S. Madikeri, Petr Motlícek, A. Stolcke

AI Summary

This paper explores the benefits of incorporating multi-turn conversational context into LLM-based Automatic Speech Recognition (ASR) systems, finding that it primarily improves recognition of contextual entities. To address the computational cost of processing long audio sequences from prior turns, they introduce "Abstract Compression," which replaces prior-turn audio with a fixed number of learned latent tokens while retaining transcripts. Experiments on in-domain and out-of-domain datasets demonstrate that the compressed model recovers some of the accuracy gains from raw-context conditioning, but with a significantly reduced audio footprint.

Key Contribution

LLM-based ASR can get a context boost without the compute cost: compress prior audio turns into learned latent tokens and retain transcripts to recover accuracy while shrinking the audio footprint.

Abstract

Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.

Inference & Quantization Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR

Related Papers