Search papers, labs, and topics across Lattice.
MiLM Plus, Xiaomi Inc. This work was performed when Wenhui Tan was visiting Xiaomi as a research intern. Abstract Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query–frame similarity matrix. Finally, A lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding. A new long-video QA dataset containing 2.8k videos with 7k question–answer pairs is collected to support the training process. Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MSJoE achieves 8.0% accuracy gain upon the base MLLM, and 1.1% higher accuracy than strongest baseline method. 1 Introduction Figure 1: A direct comparison among static key-frame sampling algorithms, trainable key-frame sampler, and our proposed MLLM-Sampler Joint Evolution framework (MSJoE). Recent progress in multimodal large language models (MLLMs) has enabled strong performance in video understanding tasks such as captioning, reasoning, and question answering [7, 25, 1, 3, 2]. However, when the video gets longer, their efficiency and accuracy degrade rapidly: The visual context length scales linearly with duration, while attention computation grows quadratically, making traditional dense uniform sampling inefficient. Furthermore, a question could involve multiple events in a video, while dense uniform sampling strategy is unlovable to overlook key events. The core challenge lies in efficiently selecting informative frames from long-form videos, where most frames are visually similar or irrelevant to the question. Thus a fixed uniform-sampling frame budget forces the model to either miss key events or spend computation on uninformative regions. This motivates the key assumption of this work: only a small subset of frames (key-frames) is truly needed to answer a question about a long-form video. Based on this assumption, we identify the fundamental question: How to obtain the key-frames? To address this, many existing approaches leverage CLIP-based similarity between the question and frames to locate relevant segments [24, 35, 23, 10]. However, this raises two further challenges: Q1: Is the question itself sufficient to retrieve all relevant frames? (Insufficiency) Q2: How to effectively sample frames based on similarity scores? (Sampling) Often, the question lacks explicit visual cues, making CLIP retrieval unreliable. Thus, the question is needed to be decomposed. Furthermore, frame-wise similarity scores are not equivalent to key-frame sampling weights: a naive top-kk strategy over similarity scores tends to select redundant frames. Hence, some works proposed heuristic algorithms [10, 24, 35] to transform similarity scores into key-frame sampling weights. However, these methods often require careful algorithm design or even specific tuning on different datasets. TSPO [23] proposes a trainable sampler that learns to select frames from CLIP similarities, yet it overlooks the fact: most MLLMs are pre-trained on uniformly sampled videos instead of key-frames [1, 3, 36, 9]. Hence, we raise the last question: Q3: Can the MLLM and sampler truly collaborate without joint evolution? (Collaboration) Effective collaboration requires two capabilities: (i) the MLLM must learn to generate reasoning queries that guide keyframe selection, and (ii) the MLLM must adapt to reason over the sparse keyframes that the sampler provides. Current methods freeze the MLLM during sampler training, preventing this bidirectional adaptation. To address these issues, we propose an MLLM-Sampler Joint Evolution framework (MSJoE) for efficient long-form video understanding. To address the insufficiency (Q1) of information to retrieve frames with question, MSJoE first reason out potential helpful perspectives, generating multiple queries that describe visual events or clues relevant to answering the question. These queries are paired with densely sampled frames to form a query–frame similarity matrix via a frozen CLIP model. A lightweight
1
0
3
2
By jointly training a keyframe sampler with an MLLM, MSJoE achieves state-of-the-art accuracy in long-form video understanding while significantly reducing computational cost.