KAUSTMar 4, 2026arXiv:2603.03831

Universal Pansharpening Foundation Model

Hebaixu Wang, Jing Zhang, Haonan Guo, Jiayi Ma, Liangpei Zhang

AI Summary

The paper introduces FoundPS, a universal pansharpening foundation model designed to overcome the limitations of existing satellite-specific and scene-dependent methods. FoundPS employs a modality-interleaved transformer to learn reversible spectral affine bases, mapping MS images into a unified latent space, and uses a latent diffusion bridge model with bridge posterior sampling for stable and controllable fusion. The model also incorporates infinite-dimensional pixel-to-latent interaction mechanisms to capture cross-domain dependencies, and is trained and evaluated on a new comprehensive benchmark, PSBench.

Key Contribution

Forget satellite-specific hacks: FoundPS achieves state-of-the-art pansharpening performance with a single model robust to diverse sensors and scenes.

Abstract

Pansharpening generates the high-resolution multi-spectral (MS) image by integrating spatial details from a texture-rich panchromatic (PAN) image and spectral attributes from a low-resolution MS image. Existing methods are predominantly satellite-specific and scene-dependent, which severely limits their generalization across heterogeneous sensors and varied scenes, thereby reducing their real-world practicality. To address these challenges, we present FoundPS, a universal pansharpening foundation model for satellite-agnostic and scene-robust fusion. Specifically, we introduce a modality-interleaved transformer that learns band-wise modal specializations to form reversible spectral affine bases, mapping arbitrary-band MS into a unified latent space via tensor multiplication. Building upon this, we construct a latent diffusion bridge model to progressively evolve latent representations, and incorporate bridge posterior sampling to couple latent diffusion with pixel-space observations, enabling stable and controllable fusion. Furthermore, we devise infinite-dimensional pixel-to-latent interaction mechanisms to comprehensively capture the cross-domain dependencies between PAN observations and MS representations, thereby facilitating complementary information fusion. In addition, to support large-scale training and evaluation, we construct a comprehensive pansharpening benchmark, termed PSBench, consisting of worldwide MS and PAN image pairs from multiple satellites across diverse scenes. Extensive experiments demonstrate that FoundPS consistently outperforms state-of-the-art methods, exhibiting superior generalization and robustness across a wide range of pansharpening tasks.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Universal Pansharpening Foundation Model

Related Papers