Mar 19, 2026arXiv:2603.18891

PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment

Tianci Luo, Jinpeng Wang, Shi-Yu Qin, Niu Lian, Yan Feng, Bin Chen, Chun Yuan, Shuhui Xia, Shu-Tao Xia

AI Summary

This paper introduces PromptHub, a novel framework for visual in-context learning (VICL) that enhances multi-prompt fusion by incorporating locality-aware mechanisms. It addresses limitations of prior patch-wise fusion methods by exploiting spatial priors for richer contextual information and employing concentration, alignment, and prediction objectives for mutually guided training. Experiments across three vision tasks demonstrate PromptHub's superior performance, universality, transferability, and robustness compared to existing approaches.

Key Contribution

Spatial awareness is the secret ingredient to unlocking better visual in-context learning, boosting performance across diverse vision tasks.

Abstract

Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out-of-distribution settings, and various retrieval scenarios. This work establishes a reliable locality-aware paradigm for prompt fusion, moving beyond prior patch-wise approaches. Code is available at https://github.com/luotc-why/ICLR26-PromptHub.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References53

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment

Related Papers