FoshanApr 2, 2026arXiv:2604.01900

FTPFusion: Frequency-Aware Infrared and Visible Video Fusion with Temporal Perturbation

Xilai Li, Xilai Li, Chusheng Fang, Chusheng Fang, Xiaosong Li, Xiaosong Li

AI Summary

FTPFusion addresses the challenge of maintaining temporal stability while preserving spatial detail in infrared and visible video fusion by using frequency decomposition and sparse cross-modal interaction. A high-frequency branch captures motion-related context, while a low-frequency branch uses temporal perturbation for robustness against video variations. The method also incorporates an offset-aware temporal consistency constraint to stabilize cross-frame representations, achieving state-of-the-art performance on public benchmarks.

Key Contribution

Achieve state-of-the-art infrared and visible video fusion by decoupling high-frequency detail preservation from low-frequency temporal stability.

Abstract

Infrared and visible video fusion plays a critical role in intelligent surveillance and low-light monitoring. However, maintaining temporal stability while preserving spatial detail remains a fundamental challenge. Existing methods either focus on frame-wise enhancement with limited temporal modeling or rely on heavy spatio-temporal aggregation that often sacrifices high-frequency details. In this paper, we propose FTPFusion, a frequency-aware infrared and visible video fusion method based on temporal perturbation and sparse cross-modal interaction. Specifically, FTPFusion decomposes the feature representations into high-frequency and low-frequency components for collaborative modeling. The high-frequency branch performs sparse cross-modal spatio-temporal interaction to capture motion-related context and complementary details. The low-frequency branch introduces a temporal perturbation strategy to enhance robustness against complex video variations, such as flickering, jitter, and local misalignment. Furthermore, we design an offset-aware temporal consistency constraint to explicitly stabilize cross-frame representations under temporal disturbances. Extensive experiments on multiple public benchmarks demonstrate that FTPFusion consistently outperforms state-of-the-art methods across multiple metrics in both spatial fidelity and temporal consistency. The source code will be available at https://github.com/ixilai/FTPFusion.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References36

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FTPFusion: Frequency-Aware Infrared and Visible Video Fusion with Temporal Perturbation

Related Papers