S4Fusion: Saliency-Aware Selective State Space Model for Infrared and Visible Image Fusion

IEEE Trans Image Process. 2025:34:4161-4175. doi: 10.1109/TIP.2025.3583132.

Abstract

The preservation and the enhancement of complementary features between modalities are crucial for multi-modal image fusion and downstream vision tasks. However, existing methods are limited to local receptive fields (CNNs) or lack comprehensive utilization of spatial information from both modalities during interaction (transformers), which results in the inability to effectively retain useful information from both modalities in a comparative manner. Consequently, the fused images may exhibit a bias towards one modality, failing to adaptively preserve salient targets from all sources. Thus, a novel fusion framework (S4Fusion) based on the Saliency-aware Selective State Space is proposed. S4Fusion introduces the Cross-Modal Spatial Awareness Module (CMSA), which is designed to simultaneously capture global spatial information from all input modalities and promote effective cross-modal interaction. This enables a more comprehensive representation of complementary features. Furthermore, to guide the model in adaptively preserving salient objects, we propose a novel perception-enhanced loss function. This loss aims to enhance the retention of salient features by minimizing ambiguity or uncertainty, as measured at a pre-trained model's decision layer, within the fused images. The code is available at https://github.com/zipper112/S4Fusion.