Corresponding authorUT ArlingtonMay 27, 2026arXiv:2605.27893

SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model Adaptation

Lingyu Xiong, Jinjin Shi, Xuran Xu, Congce Luo, Runyu Shi, Ying Huang

AI Summary

This paper introduces SIGMA, a parameter-efficient fine-tuning (PEFT) method for adapting Vision Foundation Models (VFMs) to dense prediction tasks. SIGMA bridges structural gaps via scale-adaptive fusion for multi-granularity visual information extraction and distributional gaps through semantic modulation for global feature alignment. Experiments across various dense tasks and VFM backbones show SIGMA outperforms existing PEFT methods while using only 1.72% trainable parameters.

Key Contribution

SIGMA closes the structural and distributional gaps in Vision Foundation Model adaptation, achieving state-of-the-art performance in dense prediction tasks with minimal trainable parameters.

Abstract

Vision Foundation Models (VFMs) have demonstrated impressive representational capabilities. However, adapting them to downstream tasks via full fine-tuning incurs prohibitive computational and storage overhead. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a compelling alternative, aiming to achieve performance parity with full fine-tuning at minimal training costs. Nonetheless, applying PEFT to VFMs for dense prediction tasks remains challenging due to the structural and distributional gaps. To bridge these gaps, we propose \textbf{S}cale-\textbf{I}ntegrated \textbf{G}lobal \textbf{M}odulation \textbf{A}dapter (\textbf{SIGMA}), a novel lightweight PEFT method, which consists of two modules: scale-adaptive fusion and semantic modulation. Specifically, the scale-adaptive fusion module is utilized to bridge structural gaps by enhancing the extraction of multi-granularity visual information. Furthermore, SIGMA introduces semantic modulation on the fusion features to perform global feature alignment to further eliminate the distribution gap. This design facilitates unified spatial and distributional adaptation, requiring only 1.72\% trainable parameters relative to the VFM backbone. Comprehensive experiments across various downstream dense tasks and multiple VFM backbones demonstrate that SIGMA achieves consistent and superior performance over state-of-the-art PEFT methods.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References57

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model Adaptation

Related Papers