Xiaoyi Yu

D similarity matrix. Applying top-kk over an Nquery×NframesN_{\text{query}}\times N_{\text{frames}} matrix requires pooling, but both average- and weighted-pooling yield suboptimal results. Only when paired with a trained sampler (iv) does the multi-query strategy surpass vanilla top-kk. Thus, Q2: How to effectively sample frames from similarity scores? Answer: A well-trained sampler is necessary. • A frozen MLLM cannot produce strong clip-oriented queries without training. From the comparisons (ii)→\rightarrow(iv) and Ours→\rightarrow(iii), a trained MLLM consistently outperforms a frozen one: it is trained to reason out better queries, and, even with identical frame inputs, understands them more effectively. Thus, Q3: Can the MLLM and sampler truly collaborate without joint evolution? Answer: No. Co-evolution is required to improve reasoning and key-frame understanding. 5.3.2 Further Analyses We further analyze MSJoE to validate both its effectiveness and efficiency. First, as shown in Table 2, comparing our method with setting (i) indicates that a pre-trained sampler provides a stronger initialization for RL optimization, leading to consistently improved performance. Second, Figure 4 shows that our approach consistently outperforms the uniform sampling baseline across various input budgets {8, 16, 32, 64} frames. Moreover, MSJoE achieves same or higher performance with much fewer number of frames, showing superior efficiency. In contrast, the top-kk sampling strategy can even degrade performance on LongVideoBench and VideoMME-Long, suggesting that naive similarity-based selection is less stable. Figure 5: Three frame sets from a publicly available video generated by different sampling strategies. Question: What motivated her to change dietary habits? (A) Family and friends (B) Diabetes (C) Tooth decay (D) Anemia. 5.4 Case Study We present a qualitative case study to illustrate how different sampling strategies influence the inferred answer. The question asks: “What motivated her to change dietary habits?” with four choices: (A) family, (B) diabetes, (C) tooth decay, and (D) anemia. Three frame sets sampled by different methods are shown in Figure 5: Frame set (I) shows a scattered set of frames showing buildings, family scenes, and unrelated shots. There is no visual evidence related to changing dietary habits. With such weak cues, a plausible guess might be option (A), which is the result obtained by a uniform-sampling. This demonstrates that uniform sampling easily misses events crucial for long-form reasoning. Frame set (II) mostly show people eating, with several duplicates. Although these frames correlate with the noun “eating”, they lack narrative cues about the dietary change. Given the prevalence of high-calorie food, the model may lean toward (B): the result predicted by top-k sampling. This behavior reflects a limitation of top-k: it tends to over-focus on surface-level lexical matches to the question, instead of exploring alternative reasoning paths. As illustrated in Figure 6 (left), the question-frames similarity distribution expresses more noise and lacks of concentration. Frame set (III) shows children eating snacks, followed by an interview scene, and then a dentist examining a child with an open mouth. This sequence reveals a clear narrative: the family enjoys snacks, the child develops tooth issues, and they visit the dentist, motivating a change in dietary habits. Thus, an answer of (C) could be inferred: the answer predicted by MSJoE, and also the correct answer. This is overall processing pipeline of MSJoE to this question: First, with a sparse preview and the question as input, MSJoE reasons out four query concepts (e.g., toothbrush, tooth, doctor, blood test). Next, using these queries, the CLIP model identifies high-relevance temporal regions of the video. As shown in Figure 6 (right), query–frame similarities highlight multiple meaningful events rather than a single narrow peak. Then, the U-Sampler selects key frames based on the full similarity distribution. Unlike top-k, it does not simply pick the highest scoring frames, the U-Net architecture allows it to prioritize high-scoring regions and maintain temporal diversity. This leads to more coherent narratives and better question alignment. Figure 6: Distribution of question–frame similarity for top-k (left) and query–frame similarity for MSJoE (right). Blue markers denote the frames selected by each method. 6 Conclusion We introduced MLLM-Sampler Joint Evolution, a reinforcement learning framework that learns to select key frames for long-form video understanding. The method combines query-driven retrieval, a U-Net based sampler, and joint evolution of the sampler and MLLM to identify informative frames under tight sampling budgets. To support training and evaluation, we built a long-video QA dataset with automatically generated annotations and calibrated difficulty. Extensive experiments demonstrate that MSJoE achieves state-of-the-art (SOTA) results on multiple benchmarks while using far fewer frames than dense or heuristic approaches, surpassing base-MLLM by +8 points and the strongest baseline method by +1.1 points on four long-video benchmarks. Our analysis highlights three key findings: queries provide richer semantic grounding than using the question alone; learned sampling consistently outperforms heuristic strategies; and jointly optimizing the MLLM and sampler is crucial for robust query reasoning and key-frame understanding. These results demonstrate the effectiveness of learning to sample for long-video understanding and offer a scalable direction for future multi-modality systems. References [1] S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. (2025) Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923. Cited by: §1, §1, §1, §2.1, §2.1, 2nd item, §5.1. [2] Y. Chen, F. Xue, D. Li, Q. Hu, L. Zhu, X. Li, Y. Fang, et al. (2025) LongVILA: scaling long-context visual language models for long videos. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, External Links: Link Cited by: §1, §2.1. [3] Z. Cheng, S. Leng, H. Zhang, Y. Xin, X. Li, G. Chen, Y. Zhu, W. Zhang, Z. Luo, D. Zhao, et al. (2024) VideoLLaMA 2: advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476. Cited by: §1, §1, §2.1. [4] D. G. DeepSeek-AI, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. (2025) Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. Cited by: §2.2. [5] C. Fu, Y. Dai, Y. Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y. Shen, M. Zhang, et al. (2024) Video-mme: the first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075. Cited by: §1, 3rd item. [6] K. Hu, F. Gao, X. Nie, P. Zhou, S. Tran, T. Neiman, L. Wang, M. Shah, R. Hamid, B. Yin, et al. (2025) M-llm based video frame selection for efficient video understanding. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 13702–13712. Cited by: §2.1. [7] A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. (2024) Gpt-4o system card. arXiv preprint arXiv:2410.21276. Cited by: §1, §2.1, 1st item. [8] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica (2023) Efficient memory management for large language model serving with pagedattention. External Links: 2309.06180, Link Cited by: §5.1, §7.2. [9] B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, et al. (2025) LLaVA-onevision: easy visual task transfer. Trans. Mach. Learn. Res. 2025. External Links: Link Cited by: §1. [10] S. Liu, C. Zhao, T. Xu, and B. Ghanem (2025) BOLT: boost large vision-language model without training for long-form video understanding. In Conference on Computer Vision and Pattern Recognition, pp. 3318–3327. Cited by: §1, §1, 3rd item. [11] Z. Liu, L. Zhu, B. Shi, Z. Zhang, Y. Lou, S. Yang, H. Xi, S. Cao, Y. Gu, D. Li, et al. (2025) Nvila: efficient frontier visual language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 4122–4134. Cited by: 2nd item. [12] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. (2022) Training language models to follow instructions with human feedback. Advances in neural information processing systems 35, pp. 27730–27744. Cited by: §2.2. [13] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §5.1. [14] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. Cited by: §5.1. [15] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn (2024) Direct preference optimization: your language model is secretly a reward model. Advances in Neural Information Processing Systems 36. Cited by: §2.2. [16] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. External Links: 1505.04597, Link Cited by: §4.2, §5.1. [17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.2. [18] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. (2024) Deepseekmath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §2.2, §4.3.1. [19] X. Shen, Y. Xiong, C. Zhao, L. Wu, J. Chen, C. Zhu, Z. Liu, F. Xiao, B. Varadarajan, F. Bordes, Z. Liu, H. Xu, H. J. Kim, B. Soran, R. Krishnamoorthi, M. Elhoseiny, and V. Chandra (2024) LongVU: spatiotemporal adaptive compression for long video-language understanding. arXiv preprint arXiv:2410.17434. Cited by: §2.1. [20] X. Shen, Y. Xiong, C. Zhao, L. Wu, J. Chen, C. Zhu, Z. Liu, F. Xiao, B. Varadarajan, F. Bordes, et al. (2024) Longvu: spatiotemporal adaptive compression for long video-language understanding. arXiv preprint arXiv:2410.17434. Cited by: 2nd item. [21] G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, et al. (2025) HybridFlow: A flexible and efficient RLHF framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pp. 1279–1297. External Links: Link, Document Cited by: §5.1, §7.2. [22] Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y. Shen, C. Gan, L. Gui, Y. Wang, Y. Yang, K. Keutzer, and T. Darrell (2023) Aligning large multimodal models with factually augmented rlhf. External Links: 2309.14525, Link Cited by: §2.2. [23] C. Tang, Z. Han, H. Sun, S. Zhou, X. Zhang, X. Wei, Y. Yuan, H. Zhang, J. Xu, and H. Sun (2025) TSPO: temporal sampling policy optimization for long-form video language understanding. External Links: 2508.04369, Link Cited by: §1, §1, §2.1, §2.2, 3rd item. [24] X. Tang, J. Qiu, L. Xie, Y. Tian, J. Jiao, and Q. Ye (2025) Adaptive keyframe sampling for long video understanding. arXiv preprint arXiv:2502.21271. Cited by: §1, §1, §2.1. [25] G. Team, R. Anil, S. Borgeaud, Y. Wu, J. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. (2023) Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805. Cited by: §1, §2.1, 1st item. [26] W. Wang, Z. He, W. Hong, Y. Cheng, X. Zhang, J. Qi, X. Gu, S. Huang, B. Xu, Y. Dong, et al. (2024) Lvbench: an extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035. Cited by: §1, 4th item. [27] Y. Wang, R. Zhang, H. Wang, U. Bhattacharya, Y. Fu, and G. Wu (2023) Vaquita: enhancing alignment in llm-assisted video understanding. arXiv preprint arXiv:2312.02310. Cited by: §2.1. [28] H. Wu, D. Li, B. Chen, and J. Li (2024) Longvideobench: a benchmark for long-context interleaved video-language understanding. arXiv preprint arXiv:2407.15754. Cited by: §1, 1st item. [29] M. Xu, M. Gao, Z. Gan, H. Chen, Z. Lai, H. Gang, K. Kang, and A. Dehghan (2024) SlowFast-llava: a strong training-free baseline for video large language models. External Links: 2407.15841, Link Cited by: §2.1. [30] Z. Xu, Q. Dai, T. Xie, Y. Yang, K. Qiu, D. Chen, Z. Wu, and C. Luo (2025) ViaRL: adaptive temporal grounding via visual iterated amplification reinforcement learning. arXiv preprint arXiv:2505.15447. Cited by: §2.1. [31] Y. Yao, Y. Yun, J. Wang, H. Zhang, D. Zhao, K. Tian, Z. Wang, M. Qiu, and T. Wang (2025) K-frames: scene-driven any-k keyframe selection for long video understanding. External Links: 2510.13891, Link Cited by: §2.1. [32] Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. (2025) Dapo: an open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476. Cited by: §7.2. [33] T. Yu, Y. Yao, H. Zhang, T. He, Y. Han, G. Cui, J. Hu, Z. Liu, H. Zheng, M. Sun, and T. Chua (2024) RLHF-v: towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. External Links: 2312.00849, Link Cited by: §2.2. [34] C. Yuan, Y. Yang, Y. Yang, and Z. Cheng (2025) DATE: dynamic absolute time enhancement for long video understanding. External Links: 2509.09263, Link Cited by: §2.1. [35] S. Zhang, J. Yang, J. Yin, Z. Luo, and J. Luan (2025) Q-frame: query-aware frame selection and multi-resolution adaptation for video-llms. External Links: 2506.22139, Link Cited by: §1, §1, §2.1, 3rd item. [36] Y. Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li (2025) LLaVA-video: video instruction tuning with synthetic data. External Links: 2410.02713, Link Cited by: §1, §2.1. [37] J. Zhou, Y. Shu, B. Zhao, B. Wu, S. Xiao, X. Yang, Y. Xiong, B. Zhang, T. Huang, and Z. Liu (2024) MLVU: a comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264. Cited by: §1, 2nd item. [38] Z. Zhu, H. Xu, Y. Luo, Y. Liu, K. Sarkar, Z. Yang, and Y. You (2025) FOCUS: efficient keyframe selection for long video understanding. External Links: 2510.27280, Link Cited by: §2.1. \thetitle Supplementary Material 7 More Implementation Details 7.1 Sampler Architecture We implement the key-frame sampler with a U-Net with

Papers on Lattice

Total citations

Topics

Research focus

Computer Vision (1)Multimodal Models (1)Training Efficiency & Optimization (1)

Frequent co-authors

Wenhui Tan (1)Xiaoyi Yu (1)Jiaze Li (1)Jiaze Li (1)

Papers (1)

Feb 26, 2026

MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

By jointly training a keyframe sampler with an MLLM, MSJoE achieves state-of-the-art accuracy in long-form video understanding while significantly reducing computational cost.

Wenhui Tan, Xiaoyi Yu, Xiaoyi Yu +8

Computer Vision Multimodal Models Training Efficiency & Optimization

Search

Xiaoyi Yu

Research focus

Frequent co-authors

Papers (1)