Google ResearchMax PlanckMay 28, 2026arXiv:2605.30126

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

S. Kuzucu, Alessio Tonioni, Vasile Lup, B. Schiele, Federico Tombari, Muhammad Ferjad Naeem

AI Summary

The paper introduces PARCEL, a novel visual tokenization architecture for efficient vision-language understanding that dynamically partitions feature extraction. PARCEL uses spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors via Pool-Conditioned Query Resampling, encouraging query tokens to focus on complementary visual features. Experiments across 27 benchmarks demonstrate that PARCEL improves the performance-efficiency Pareto frontier compared to existing matryoshka baselines across various visual-token budgets.

Key Contribution

Achieve state-of-the-art efficiency in vision-language models by dynamically partitioning feature extraction, outperforming existing methods across 27 benchmarks.

Abstract

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the"train once, deploy anywhere"paradigm.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Multimodal Models

Citation Metrics

Citations0

Influential citations0

References105

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Related Papers