NUSJun 9, 2026arXiv:2606.10572

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Zhi Zheng, Ziqiao Meng, Hao Luan, Wei Liu, Wee Sun Lee

AI Summary

This paper introduces Latent Memory, a novel memory paradigm that enables resource-efficient question answering (QA) by representing multimodal evidence as high-dimensional latent tokens instead of raw text or images. By employing a small compressor model to generate these latent tokens, the approach significantly reduces token consumption and storage requirements while maintaining competitive QA performance across multiple benchmarks. Evaluations show that Latent Memory achieves 3x to 10x fewer generator tokens used compared to existing retrieval-augmented generation (RAG) systems, while also excelling in image-grounded QA tasks.

Key Contribution

Latent Memory slashes token usage by up to 10x while maintaining competitive performance in multimodal question answering.

Abstract

External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, a latent-space memory paradigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a small compressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevant latent tokens, and the retrieved latent tokens are directly prompted to a pretrained LLM or VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, and distillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.

Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Related Papers