Deformation-Resilient Multigranularity Learning for Unaligned RGB-T Semantic Segmentation

IEEE Trans Neural Netw Learn Syst. 2025 Jul 8:PP. doi: 10.1109/TNNLS.2025.3585105. Online ahead of print.

Abstract

RGB-Thermal semantic segmentation (SS) aims to combine visual light and thermal images to determine the semantic category for each pixel and create an object mask. While existing methods typically rely on well-aligned RGB-T image pairs, real-world RGB-T pairs are often unaligned, and pixel-by-pixel alignment is both challenging and time-consuming. To address this critical issue, we introduce a new unaligned RGB-T SS benchmark and propose the deformation-resilient multigranularity learning (DML) method. DML explores the spatial consistency and modal complementarity of RGB-T and mitigates the interference of warped modalities by aligning multimodal features in a coarse-to-fine multigranularity strategy. Specifically, DML constructs a deformation-aware complementary feature enhancer (DCFE), which consists of deformation-aware feature alignment (DFA) and complementary feature aggregation (CFA) modules. DFA enhances the spatial alignment of RGB-T by estimating the deformation field of warped features. Then, CFA aggregates complementary contexts of modal differences across multiple scales to produce deformation-resilient and robust RGB-T feature representations. Finally, we design the multigranularity mask refinement engine (MMFE), which combines class-agnostic saliency prediction (CSP) and class-aware edge generation (CEG) auxiliary tasks to provide useful boundary and positional cues for SS decoders. The MMFE enhances semantic alignment and interclass separability, yielding object masks with sharp boundaries. Quantitative and qualitative experiments on aligned and unaligned datasets validate the effectiveness of our proposed DML, consistently outperforming existing methods designed for aligned RGB-T data. The new unaligned RGB-T SS benchmark and code are available at https://github.com/VisionVerse/Unaligned-RGBT-Semantic-Segmentation.