Apr 7, 2026arXiv:2604.05601

ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference

AI Summary

The paper introduces ID-Selection, a novel visual token selection strategy for efficient LVLM inference that balances token importance and diversity. It iteratively selects high-scoring tokens while suppressing the scores of similar tokens, reducing redundancy. Experiments across 5 LVLM backbones and 16 benchmarks show that ID-Selection achieves superior performance and efficiency, especially under high pruning ratios, pruning up to 97.2% of visual tokens while preserving performance.

Key Contribution

Achieve >97% FLOPs reduction in LVLM inference with minimal performance loss by intelligently pruning redundant visual tokens, all without retraining.

Abstract

Recent advances have explored visual token pruning to accelerate the inference of large vision-language models (LVLMs). However, existing methods often struggle to balance token importance and diversity: importance-based methods tend to retain redundant tokens, whereas diversity-based methods may overlook informative ones. This trade-off becomes especially problematic under high reduction ratios, where preserving only a small subset of visual tokens is critical. To address this issue, we propose ID-Selection, a simple yet effective token selection strategy for efficient LVLM inference. The key idea is to couple importance estimation with diversity-aware iterative selection: each token is first assigned an importance score, after which high-scoring tokens are selected one by one while the scores of similar tokens are progressively suppressed. In this way, ID-Selection preserves informative tokens while reducing redundancy in a unified selection process. Extensive experiments across 5 LVLM backbones and 16 main benchmarks demonstrate that ID-Selection consistently achieves superior performance and efficiency, especially under extreme pruning ratios. For example, on LLaVA-1.5-7B, ID-Selection prunes 97.2% of visual tokens, retaining only 16 tokens, while reducing inference FLOPs by over 97% and preserving 91.8% of the original performance, all without additional training.

Computer Vision Inference & Quantization Multimodal Models

Citation Metrics

Citations0

Influential citations0

References41

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference

Related Papers