BAAINTUPKUUESTCUniversity of Electronic Science and TechnologyJun 25, 2026arXiv:2606.27161

TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference

Tinghao Wang, Yichen Guo, Rui Huang, Zheng Lu, Qizhe Zhang, Chenxi Li, Yuan Zhang, Jiajun Cao, Zhirong Shen, Yaosong Du, Guangyan Gan, Wenya Wang, Lin William Cong, Shanghang Zhang

AI Summary

This paper introduces TOPS, a novel method for visual token pruning in multimodal large language models (MLLMs) that constructs Token Optimal Preservation Sets based on three principles: Task Relevance, Information Coverage, and Semantic Diversity. By applying a top-down information-theoretic analysis, TOPS effectively reduces the number of visual tokens while maintaining performance, outperforming existing methods across various benchmarks. Notably, it achieves a 77.8% reduction in visual tokens on LLaVA-NeXT without sacrificing performance, highlighting its potential to enhance efficiency in MLLM inference.

Key Contribution

Pruning 77.8% of visual tokens without losing performance could revolutionize the efficiency of multimodal large language models.

Abstract

Multimodal large language models (MLLMs) have achieved strong multimodal reasoning capabilities, but their efficiency is limited by the large number of visual tokens, which introduces substantial computational overhead. Visual token pruning offers a natural solution, yet existing methods are imperfect: attention-based criteria tend to retain redundant tokens, while diversity-based criteria are often agnostic to user instructions. Even methods that combine multiple criteria still lack a principled formulation of the intrinsic objective of token pruning. In this paper, we revisit visual token pruning from a first-principles perspective and formulate it as constructing Token Optimal Preservation Sets. Through a top-down information-theoretic analysis, we identify three fundamental principles for effective token selection: Task Relevance, Information Coverage, and Semantic Diversity. Based on these principles, we propose TOPS, a training-free and model-agnostic pruning module that can be applied to various MLLMs. Extensive experiments on 7 MLLM backbones and 14 benchmarks demonstrate that TOPS outperforms prior methods under diverse pruning settings. Notably, on LLaVA-NeXT, TOPS removes 77.8% of visual tokens while preserving 100.0% and 100.6% performance on its 7B and 13B models, respectively, suggesting that pruning redundant visual tokens can sometimes mitigate hallucination and inspire future lightweight MLLM design.

Inference & Quantization Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference

Related Papers