Tsinghua AIDigital China GroupJoy Future AcademyNortheasternQifu TechnologyJun 4, 2026arXiv:2606.05917

MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question Answering

Qing Yang, Pengcheng Huang, Xinze Li, Zhenghao Liu, Yukun Yan, Yu Gu, Ge Yu, Gang Li, Maosong Sun

AI Summary

This paper introduces MemoryCard, a novel framework designed to enhance long-video question answering by organizing videos into semantically coherent units called Memory Cards. By segmenting videos based on distinct topics or events and generating event-level gists, MemoryCard significantly improves the retrieval of relevant information for answering questions. Experimental results show that this approach achieves up to a 21.8% relative improvement in accuracy compared to traditional methods, demonstrating its effectiveness in addressing the challenges of sparse and dispersed evidence in lengthy video contexts.

Key Contribution

MemoryCard transforms long videos into coherent, topic-focused segments, boosting long-video QA accuracy by over 21% while maintaining visual-token efficiency.

Abstract

Long-video question answering remains challenging for Vision-Language Models (VLMs), as answer-relevant evidence is often sparse, transient, and temporally dispersed across lengthy video contexts. Existing frame-centric approaches improve efficiency through uniform sampling, query-aware frame selection, visual-token compression, and adaptive resolution strategies. However, they still rely on isolated and fragmented frames as the fundamental evidence units, limiting VLMs' ability to effectively capture coherent event-level semantics. To address this limitation, we propose MemoryCard, a video-memory-based augmentation framework that organizes long videos into self-contained Memory Cards. Specifically, MemoryCard first performs a self-reading process over videos and aligned utterances to segment the video into semantically coherent units, each corresponding to a distinct topic or event. For each unit, it generates an event-level video gist and selects representative visual moments, which are then rendered into unified Memory Cards for retrieval and question answering. Experimental results demonstrate that MemoryCard consistently improves long-video QA performance under comparable visual-token budgets, achieving up to a 21.8% relative improvement in accuracy. All code is available at https://github.com/NEUIR/MemoryCard.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question Answering

Related Papers