Search papers, labs, and topics across Lattice.
This paper introduces EnvMem, a multi-turn benchmark for evaluating the retention of non-speech acoustic information in large audio language models (LALMs). Using EnvMem, the authors identify representational trajectory drift as the primary bottleneck preventing LALMs from remembering acoustic cues across multiple turns. They further show that attention allocation plays a less significant role in this memory degradation, suggesting that improving representational stability is key to enhancing LALM acoustic memory.
LALMs struggle to remember non-speech sounds across multi-turn conversations not because of faulty attention, but because their internal representations of those sounds drift over time.
Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, and the underlying mechanisms of representation and retrieval are still unclear. This work introduces EnvMem, a controlled multi-turn benchmark designed to study this gap and identify the root causes of failures at the representation (i.e., latent embeddings) and retrieval levels (i.e., attention allocation). We further conduct post-hoc interventions to probe representational structure and attention dynamics. Our results reveal representational trajectory drift as the key failure mode, while showing that attention allocation plays a limited role in explaining the observed degradation. Overall, we provide a systematic framework for analyzing and improving non-linguistic memory in long-context LALMs, shedding light on future data and training design for robust acoustic memory modeling.