Zhengyao Fang

KTopnk(G~k).\widetilde{\mathbf{X}}=\bigcup_{k=1}^{K}\text{Top}_{n_{k}}(\widetilde{G}_{k}). (6) 3.4 Information-Aware Dynamic Compression Ratio Across Images Conventional token compression methods employ a fixed token compression ratio r=NTr=\frac{N}{T} for all images. This uniform approach leads to suboptimal compression: for complex scenes, the predetermined N proves insufficient, causing excessive information loss; whereas for simple scenes, the same N becomes unnecessarily large, resulting in substantial redundancy. To address this limitation, we propose an information-aware dynamic compression ratio strategy that automatically adjusts the retained token budget NN according to each image’s information content. Building upon the global redundancy measure ρ\rho from Eq. 4, we compute an image information score: ϕ=1−ρ,\phi=1-\rho, (7) where higher ϕ\phi indicates greater semantic diversity and less redundancy. We then allocate the retained token count N′N^{\prime} for each image in proportion to its information score: N′=∝ϕN^{\prime}=\propto\phi. This ensures that more informative images are allocated more tokens, while simpler images are compressed more aggressively, thereby improving compression adaptiveness across diverse scenes. 3.5 Theoretical Overview of PSCA–NMS Pruning Our pruning objective is to retain a token subset S′S^{\prime} of fixed size NN that maximizes the effective information necessary for VLM. Formally, for a token set SS, its informativeness is defined as: Inform(S)=⋃si∈SI(si),\text{Inform}(S)=\bigcup_{s_{i}\in S}I(s_{i}), (8) where I(si)I(s_{i}) denotes the semantic information of token sis_{i}. Using the Inclusion–Exclusion Principle, the informativeness of S′S^{\prime} admits the lower bound: Inform(S′)≥∑si∈S′I(si)−∑si,sj∈S′R(si,sj),\text{Inform}(S^{\prime})\geq\sum_{s_{i}\in S^{\prime}}I(s_{i})-\sum_{s_{i},s_{j}\in S^{\prime}}R(s_{i},s_{j}), (9) where R(si,sj)R(s_{i},s_{j}) measures semantic redundancy. PSCA optimizes the first term by selecting tokens with the largest projections onto the principal semantic directions, effectively maximizing ∑si∈S′I(si)\sum_{s_{i}\in S^{\prime}}I(s_{i}). NMS complements this by enforcing a similarity constraint R(si,sj)≤ϵR(s_{i},s_{j})\leq\epsilon, which minimizes redundancy in the second term and ensures diverse semantic coverage. Together, PSCA and NMS jointly approximate the maximization objective: max|S′|=N∑si∈S′I(si)s.t.R(si,sj)≤ϵ,\max_{|S^{\prime}|=N}\ \sum_{s_{i}\in S^{\prime}}I(s_{i})\quad\text{s.t.}\quad R(s_{i},s_{j})\leq\epsilon, (10) providing a theoretically grounded and effective pruning strategy. The full derivation is presented in Appendix A.1.1. 4 Experiments Following the experimental protocol of Yang et al. (2024), we assess the effectiveness of our approach on LLaVA-1.5 Liu et al. (2023). To evaluate generalization, we extend our study to high-resolution vision-language models, including LLaVA-NeXT Liu et al. (2024b) and Mini-Gemini Li et al. (2024). We also conduct experiments on Qwen-VL Wang et al. (2024) in the supplementary material. Evaluations are conducted using LMMs-Eval Zhang et al. (2024a) on a comprehensive suite of widely used visual understanding benchmarks, including GQA Hudson and Manning (2019), MMBench Liu et al. (2024d), MME Fu et al. (2023), POPE Li et al. (2023b), ScienceQA Lu et al. (2022), VQA-v2 Goyal et al. (2017), TextVQA Singh et al. (2019), MMMU Yue et al. (2024), SEED-Bench Li et al. (2023a), VizWiz Gurari et al. (2018), and LLaVA-Bench Liu et al. (2024a). We further evaluate the applicability of our method to video understanding tasks using Video-LLaVA Lin et al. (2023). 4.1 Main results on Image Understanding Tasks Results on LLaVA-1.5. LLaVA-1.5 uniformly resizes input images to a resolution of

Papers on Lattice

Total citations

Topics