Search papers, labs, and topics across Lattice.
MedPruner, a training-free hierarchical token pruning framework, is introduced to address computational inefficiencies in 3D medical VLMs caused by anatomical redundancy and fixed pruning ratios. It employs an Inter-slice Anchor-based Filtering module to remove slice-level redundancy, followed by Dynamic Information Nucleus Selection based on cumulative attention weights for adaptive token-level compression. Experiments across three 3D medical benchmarks and VLMs show MedPruner can maintain or improve performance with as little as 5% of the original visual tokens, significantly reducing computational costs.
Achieve up to 20x speedup in 3D medical VLMs without sacrificing accuracy by pruning away 95% of visual tokens *without* retraining.
While specialized Medical Vision-Language Models (VLMs) have achieved remarkable success in interpreting 2D and 3D medical modalities, their deployment for 3D volumetric data remains constrained by significant computational inefficiencies. Current architectures typically suffer from massive anatomical redundancy due to the direct concatenation of consecutive 2D slices and lack the flexibility to handle heterogeneous information densities across different slices using fixed pruning ratios. To address these challenges, we propose MedPruner, a training-free and model-agnostic hierarchical token pruning framework specifically designed for efficient 3D medical image understanding. MedPruner introduces a two-stage mechanism: an Inter-slice Anchor-based Filtering module to eliminate slice-level temporal redundancy, followed by a Dynamic Information Nucleus Selection strategy that achieves adaptive token-level compression by quantifying cumulative attention weights. Extensive experiments on three 3D medical benchmarks and across three diverse medical VLMs reveal massive token redundancy in existing architectures. Notably, MedPruner enables models such as MedGemma to maintain or even exceed their original performance while retaining fewer than 5% of visual tokens, thereby drastically reducing computational overhead and validating the necessity of dynamic token selection for practical clinical deployment. Our code will be released.