AucklandKAISTMelbourneUNSWMay 26, 2026arXiv:2605.27039

Why Can't They Remember? Uncovering Representation and Retrieval Bottlenecks in Multi-Turn Acoustic Memory

Yang Xiao, Siyi Wang, Han Yin, Hong Jia, V. Sethu, Vidhyasaharan Sethu, Eun-Jung Holden, Ting Dang

AI Summary

This paper introduces EnvMem, a multi-turn benchmark for evaluating the retention of non-speech acoustic information in large audio language models (LALMs). Using EnvMem, the authors identify representational trajectory drift as the primary bottleneck preventing LALMs from remembering acoustic cues across multiple turns. They further show that attention allocation plays a less significant role in this memory degradation, suggesting that improving representational stability is key to enhancing LALM acoustic memory.

Key Contribution

LALMs struggle to remember non-speech sounds across multi-turn conversations not because of faulty attention, but because their internal representations of those sounds drift over time.

Abstract

Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, and the underlying mechanisms of representation and retrieval are still unclear. This work introduces EnvMem, a controlled multi-turn benchmark designed to study this gap and identify the root causes of failures at the representation (i.e., latent embeddings) and retrieval levels (i.e., attention allocation). We further conduct post-hoc interventions to probe representational structure and attention dynamics. Our results reveal representational trajectory drift as the key failure mode, while showing that attention allocation plays a limited role in explaining the observed degradation. Overall, we provide a systematic framework for analyzing and improving non-linguistic memory in long-context LALMs, shedding light on future data and training design for robust acoustic memory modeling.

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Why Can't They Remember? Uncovering Representation and Retrieval Bottlenecks in Multi-Turn Acoustic Memory

Related Papers