Person re-identification (Re-ID) aims to accurately match individuals across different camera views, a critical task for surveillance and security applications, often under varying conditions such as illumination, pose, and background. Traditional Re-ID systems operate solely in the visible spectrum, which limits their effectiveness under varying lighting conditions and at night. To overcome these limitations, leveraging the visible-infrared (VIS-IR) domain becomes essential, as infrared imaging provides reliable information in low-light and night-time environments. However, the integration of VIS (visible) and IR (infrared) modalities introduces significant cross-modality discrepancies, posing a major challenge for feature alignment and fusion. To address this, we propose NiCTRAM: a Nyströmformer-based Cross-Modality Transformer designed for robust VIS-IR person re-identification. Our framework begins by extracting hierarchical features from both RGB and IR images through a shared convolutional neural network (CNN) backbone, ensuring the preservation of modality-specific characteristics. These features are then processed by parallel Nyströmformer encoders, which efficiently capture long-range dependencies in linear time using lightweight self-attention mechanisms. To bridge the modality gap, a cross-attention fusion block is introduced, where RGB and IR features interact and integrate second-order covariance statistics to model higher-order correlations. The fused features are subsequently refined through projection layers and optimized for re-identification using a classification head. Extensive experiments on benchmark VIS-IR person Re-ID datasets demonstrate that NiCTRAM outperforms existing methods, achieving state-of-the-art accuracy and robustness by effectively addressing the cross-modality challenges inherent in VIS-IR Re-ID. The proposed NiCTRAM model achieves significant improvements over the current SOTA in VIS-IR ReID. On the SYSU-MM01 dataset, it surpasses the SOTA by 4.21% in Rank-1 accuracy and 2.79% in mAP for all-search single-shot mode, with similar gains in multi-shot settings. Additionally, NiCTRAM outperforms existing methods on RegDB and LLCM, achieving up to 5.90% higher Rank-1 accuracy and 5.83% higher mAP in Thermal-to-Visible mode. We will make the code and the model available at https://github.com/Ranjitkm2007/NiCTRAM.
Keywords: Cross-Modality Fusion; Higher Order Attention; Nyströmformer; Person Re-identification.
© 2025. The Author(s).