Search papers, labs, and topics across Lattice.
This paper introduces TOPS, a novel method for visual token pruning in multimodal large language models (MLLMs) that constructs Token Optimal Preservation Sets based on three principles: Task Relevance, Information Coverage, and Semantic Diversity. By applying a top-down information-theoretic analysis, TOPS effectively reduces the number of visual tokens while maintaining performance, outperforming existing methods across various benchmarks. Notably, it achieves a 77.8% reduction in visual tokens on LLaVA-NeXT without sacrificing performance, highlighting its potential to enhance efficiency in MLLM inference.
Pruning 77.8% of visual tokens without losing performance could revolutionize the efficiency of multimodal large language models.
Multimodal large language models (MLLMs) have achieved strong multimodal reasoning capabilities, but their efficiency is limited by the large number of visual tokens, which introduces substantial computational overhead. Visual token pruning offers a natural solution, yet existing methods are imperfect: attention-based criteria tend to retain redundant tokens, while diversity-based criteria are often agnostic to user instructions. Even methods that combine multiple criteria still lack a principled formulation of the intrinsic objective of token pruning. In this paper, we revisit visual token pruning from a first-principles perspective and formulate it as constructing Token Optimal Preservation Sets. Through a top-down information-theoretic analysis, we identify three fundamental principles for effective token selection: Task Relevance, Information Coverage, and Semantic Diversity. Based on these principles, we propose TOPS, a training-free and model-agnostic pruning module that can be applied to various MLLMs. Extensive experiments on 7 MLLM backbones and 14 benchmarks demonstrate that TOPS outperforms prior methods under diverse pruning settings. Notably, on LLaVA-NeXT, TOPS removes 77.8% of visual tokens while preserving 100.0% and 100.6% performance on its 7B and 13B models, respectively, suggesting that pruning redundant visual tokens can sometimes mitigate hallucination and inspire future lightweight MLLM design.