Mar 30, 2026arXiv:2603.28696

AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

Haozhe Qi, Kevin Qu, Kevin Qu, Mahdi Rad, Mahdi Rad, Rui Wang, Rui Wang, Alexander Mathis, Alexander Mathis, Marc Pollefeys, Marc Pollefeys

AI Summary

AdaptToken is a training-free framework for long video understanding with MLLMs that uses the model's self-uncertainty, measured by response entropy, to guide token selection across video clips. It allocates a token budget based on the estimated prompt relevance of each group of frames and supports early stopping when the model reaches sufficient certainty. Experiments on four long-video benchmarks show AdaptToken improves accuracy and benefits from long inputs, while AdaptToken-Lite reduces inference time by about half.

Key Contribution

MLLMs can now efficiently process 10K-frame videos without training, by adaptively selecting tokens based on the model's own uncertainty about the content.

Abstract

Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References57

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

Related Papers