CTFS: A consolidated transformer framework for instance and semantic segmentation tasks

Neural Netw. 2025 Jun 24:191:107745. doi: 10.1016/j.neunet.2025.107745. Online ahead of print.

Abstract

Instance segmentation and semantic segmentation are fundamental tasks that support many computer vision applications. Recently, researchers have investigated the feasibility of constructing a unified transformer framework and leveraging multi-task learning techniques to optimize instance and semantic segmentation tasks simultaneously. However, these methods learn the proportion and distribution of task-shared parameters concurrently during the training process, which inevitably presents a challenge to sufficiently optimize the network. In addition, conventional gradient rectification algorithms attempt to address gradient conflicts from an overall perspective, but they fall short of adequately resolving conflicts among individual elements within gradient vectors. In this study, we develop a consolidated Transformer framework CTFS to address these issues. To address the first issue, we introduce an affinity-guided sharing strategy (AGSS) that learns the proportion and distribution of task-shared parameters in two separate stages. This approach leverages the proportion of task-shared parameters as prior knowledge to guide the subsequent learning process, reducing the difficulty of network optimization. To address the second issue, we propose a fine-grained gradient rectification strategy (FGRS) that effectively mitigates gradient conflicts for each element in gradient vectors during backpropagation. Built upon the standard Swin Transformer without complicating its network architecture, CTFS achieves impressive performance on both the COCO dataset for the instance segmentation task and the ADE20K dataset for the semantic segmentation task.

Keywords: Gradient conflicts; Instance segmentation; Multi-task learning; Semantic segmentation.