Search papers, labs, and topics across Lattice.
This paper introduces ATV-Pruning, an asymmetric text-visual pruning method for large vision-language models (LVLMs) that addresses the divergent behaviors of textual and visual tokens. The method calibrates the textual pathway using text tokens due to its higher sensitivity and exploits the high redundancy in the visual pathway by allowing up to 50% sparsity. ATV-Pruning adaptively constructs a calibration pool with all textual tokens and a subset of visual tokens, using a layer-adaptive selection strategy to identify important visual tokens, demonstrating superior performance on multimodal benchmarks.
Textual pathways in LVLMs are more sensitive to pruning than visual pathways, implying that you can aggressively prune visual inputs without significantly impacting performance.
Network pruning is an effective technique for enabling lightweight Large Vision-Language Models (LVLMs), which primarily incorporates both weights and activations into the importance metric. However, existing efforts typically process calibration data from different modalities in a unified manner, overlooking modality-specific behaviors. This raises a critical challenge: how to address the divergent behaviors of textual and visual tokens for accurate pruning of LVLMs. To this end, we systematically investigate the sensitivity of visual and textual tokens to the pruning operation by decoupling their corresponding weights, revealing that: (i) the textual pathway should be calibrated via text tokens, since it exhibits higher sensitivity than the visual pathway; (ii) the visual pathway exhibits high redundancy, permitting even 50% sparsity. Motivated by these insights, we propose a simple yet effective Asymmetric Text-Visual Weight Pruning method for LVLMs, dubbed ATV-Pruning, which establishes the importance metric for accurate weight pruning by selecting the informative tokens from both textual and visual pathways. Specifically, ATV-Pruning integrates two primary innovations: first, a calibration pool is adaptively constructed by drawing on all textual tokens and a subset of visual tokens; second, we devise a layer-adaptive selection strategy to yield important visual tokens. Finally, extensive experiments across standard multimodal benchmarks verify the superiority of our ATV-Pruning over state-of-the-art methods.