HETMCL: High-Frequency Enhancement Transformer and Multi-Layer Context Learning Network for Remote Sensing Scene Classification

Haiyan Xu; Yanni Song; Gang Xu; Ke Wu; Jianguang Wen

doi:10.3390/s25123769

HETMCL: High-Frequency Enhancement Transformer and Multi-Layer Context Learning Network for Remote Sensing Scene Classification

Sensors (Basel). 2025 Jun 17;25(12):3769. doi: 10.3390/s25123769.

Authors

Haiyan Xu^{1

2

3

4}, Yanni Song⁵, Gang Xu^{1

2

3

6}, Ke Wu⁵, Jianguang Wen⁷

Affiliations

¹ Zhejiang College of Security Technology, Wenzhou 325000, China.
² Wenzhou Future City Research Institute, Wenzhou 325000, China.
³ Wenzhou Collaborative Innovation Center for Space-Borne, Airborne and Ground Monitoring Situational Awareness Technology, Wenzhou 325000, China.
⁴ Wenzhou Institute of Geospatial Information Technology, Wenzhou 325000, China.
⁵ Institute of Geophysics and Geomatics, China University of Geosciences, Wuhan 430074, China.
⁶ Wenzhou Key Laboratory of Natural Disaster Remote Sensing Monitoring and Early Warning, Wenzhou 325000, China.
⁷ State Key Laboratory of Remote Sensing Science, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China.

Abstract

Remote Sensing Scene Classification (RSSC) is an important and challenging research topic. Transformer-based methods have shown encouraging performance in capturing global dependencies. However, recent studies have revealed that Transformers perform poorly in capturing high frequencies that mainly convey local information. To solve this problem, we propose a novel method based on High-Frequency Enhanced Vision Transformer and Multi-Layer Context Learning (HETMCL), which can effectively learn the comprehensive features of high-frequency and low-frequency information in visual data. First, Convolutional Neural Networks (CNNs) extract low-level spatial structures, and the Adjacent Layer Feature Fusion Module (AFFM) reduces semantic gaps between layers to enhance spatial context. Second, the High-Frequency Information Enhancement Vision Transformer (HFIE) includes a High-to-Low-Frequency Token Mixer (HLFTM), which captures high-frequency details. Finally, the Multi-Layer Context Alignment Attention (MCAA) integrates multi-layer features and contextual relationships. On UCM, AID, and NWPU datasets, HETMCL achieves state-of-the-art OA of 99.76%, 97.32%, and 95.02%, respectively, outperforming existing methods by up to 0.38%.

Keywords: convolutional neural network (CNN); remote sensing scene classification (RSSC); transformer.