Search papers, labs, and topics across Lattice.
This paper introduces MemoryCard, a novel framework designed to enhance long-video question answering by organizing videos into semantically coherent units called Memory Cards. By segmenting videos based on distinct topics or events and generating event-level gists, MemoryCard significantly improves the retrieval of relevant information for answering questions. Experimental results show that this approach achieves up to a 21.8% relative improvement in accuracy compared to traditional methods, demonstrating its effectiveness in addressing the challenges of sparse and dispersed evidence in lengthy video contexts.
MemoryCard transforms long videos into coherent, topic-focused segments, boosting long-video QA accuracy by over 21% while maintaining visual-token efficiency.
Long-video question answering remains challenging for Vision-Language Models (VLMs), as answer-relevant evidence is often sparse, transient, and temporally dispersed across lengthy video contexts. Existing frame-centric approaches improve efficiency through uniform sampling, query-aware frame selection, visual-token compression, and adaptive resolution strategies. However, they still rely on isolated and fragmented frames as the fundamental evidence units, limiting VLMs' ability to effectively capture coherent event-level semantics. To address this limitation, we propose MemoryCard, a video-memory-based augmentation framework that organizes long videos into self-contained Memory Cards. Specifically, MemoryCard first performs a self-reading process over videos and aligned utterances to segment the video into semantically coherent units, each corresponding to a distinct topic or event. For each unit, it generates an event-level video gist and selects representative visual moments, which are then rendered into unified Memory Cards for retrieval and question answering. Experimental results demonstrate that MemoryCard consistently improves long-video QA performance under comparable visual-token budgets, achieving up to a 21.8% relative improvement in accuracy. All code is available at https://github.com/NEUIR/MemoryCard.