TAUTrentoMar 2, 2026arXiv:2603.01400

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

Jinlong Li, Liyu Jiang, Haonan Zhang, Niculae Sebe

AI Summary

This paper introduces Anchor Optimal Transport (AOT), a training-free token reduction method for Video Large Language Models (VLLMs) that addresses spatiotemporal redundancy. AOT establishes local and global token anchors within and between video frames, using optimal transport to aggregate informative contexts from pruned tokens. Experiments demonstrate that AOT achieves competitive performance on short- and long-video benchmarks while significantly improving computational efficiency.

Key Contribution

VLLMs can be made much faster without sacrificing accuracy by intelligently merging redundant tokens across space and time using optimal transport.

Abstract

Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token \textbf{A}nchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global \textbf{O}ptimal \textbf{T}ransport (\textbf{AOT}). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: \href{https://tyroneli.github.io/AOT}{AOT}.

Inference & Quantization Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References76

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

Related Papers