Apr 9, 2026arXiv:2604.08120

Small Vision-Language Models are Smart Compressors for Long Video Understanding

Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou, Wei Wen, Junlin Han, Mingchen Zhuge, Saksham Suri, Qiangbo Qian, Shuming Liu, Lemeng Wu, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Chenchen Zhu

AI Summary

The paper introduces Tempo, a framework that uses a Small Vision-Language Model (SVLM) as a query-aware compressor for long videos, addressing the context limit bottleneck in Multimodal Large Language Models (MLLMs). Tempo employs Adaptive Token Allocation (ATA) to dynamically allocate tokens based on relevance, compressing redundant information and prioritizing query-critical segments. Experiments demonstrate that Tempo, a 6B architecture, achieves state-of-the-art performance on long video understanding tasks, even surpassing GPT-4o and Gemini 1.5 Pro under strict token budgets.

Key Contribution

Forget brute-force context windows: a small vision-language model can compress hour-long videos below theoretical limits by intelligently prioritizing relevant content.

Abstract

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free $O(1)$ dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.

Computer Vision Inference & Quantization Multimodal Models

Citation Metrics

Citations0

Influential citations0

References42

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Small Vision-Language Models are Smart Compressors for Long Video Understanding

Related Papers