BeihangApr 20, 2026arXiv:2604.18260

Geometry-Guided 3D Visual Token Pruning for Video-Language Models

Han Li, Zehao Huang, Jiahui Fu, Naiyan Wang, Si Liu

AI Summary

This paper introduces Geo3DPruner, a novel framework for pruning visual tokens in 3D scene understanding tasks, addressing the inefficiencies of existing methods that fail to consider view consistency and spatial diversity. By employing geometry-aware global attention, the approach effectively models cross-frame relevance and implements a two-stage pruning process that selects representative features and maintains spatial diversity. The results show that Geo3DPruner can prune 90% of visual tokens while retaining over 90% of the original model performance, marking a significant advancement in efficient inference for video-language models.

Key Contribution

Pruning 90% of visual tokens without sacrificing performance could revolutionize the efficiency of 3D scene understanding in multimodal models.

Abstract

Multimodal large language models have demonstrated remarkable capabilities in 2D vision, motivating their extension to 3D scene understanding. Recent studies represent 3D scenes as 3D spatial videos composed of image sequences with depth and camera pose information, enabling pre-trained video-language models to perform 3D reasoning tasks. However, the large number of visual tokens in spatial videos remains a major bottleneck for efficient inference and context management. Existing pruning methods overlook the view consistency of spatial videos and the spatial diversity of the remaining tokens, which prevents them from effectively removing inter-frame redundancy and preserving scene completeness. In this paper, we propose Geo3DPruner, a Geometry-Guided 3D Visual Token Pruning framework. Geo3DPruner first models cross-frame relevance through geometry-aware global attention, and then performs a two-stage pruning process. The intra-voxel stage selects representative multi-view features within each voxel, while the inter-voxel stage preserves spatial diversity by selecting a globally distributed subset of voxels. Extensive experiments on multiple 3D scene understanding benchmarks demonstrate that Geo3DPruner retains over 90% of the original performance while pruning 90% of visual tokens, significantly outperforming existing text-guided and vision-guided pruning methods.

Computer Vision Inference & Quantization Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Geometry-Guided 3D Visual Token Pruning for Video-Language Models

Related Papers