DMSA-Net: a deformable multiscale adaptive classroom behavior recognition network

PeerJ Comput Sci. 2025 Apr 30:11:e2876. doi: 10.7717/peerj-cs.2876. eCollection 2025.

Abstract

In the intelligent transformation of education, accurate recognition of students' classroom behavior has become one of the key technologies for enhancing the quality of instruction and the efficacy of learning. However, in the recognition of target behavior in real classroom scenarios, due to the use of wide-angle or panoramic images for image acquisition, students in the back row are far away from monitoring devices, and their subtle body movements such as the small opening and closing of the mouth (to determine whether they are speaking), fine finger operations (to distinguish between reading books or operating mobile phones) are difficult to recognize. Moreover, there are occlusions and scale differences in the front and back rankings, which can easily cause confusion and interference with target features in the detection process, greatly limiting the accurate recognition ability of existing visual algorithms for classroom behavior. This article proposes a deformable multiscale adaptive classroom behavior recognition network. To improve the network's capacity to model minute behavioral phenomena, the backbone section introduces a deformable self-attention dattention module, dynamically modifying the receptive field's geometry to enhance the model's concentration on the region of interest. To improve the network's capacity for feature extraction and integration of behavior occlusion and classroom behavior at different scales, a proposal has been put forward the Multiscale Attention Feature Pyramid Structure (MSAFPS), to achieve multi-level feature aggregation after multiscale feature fusion, reducing the impact of mutual occlusion and scale differences in classroom behavior between front and back rows. In the detect section, we adopt the Wise Intersection Over Union (Wise-IoU) loss as our loss criterion, augmenting the evaluation framework with richer contextual cues to broaden its scope and elevate the network's detection prowess. Extensive experimentation reveals that our proposed method outperforms rival algorithms on two widely adopted benchmark datasets: SCB-Dataset3-S (the Student Classroom Behavior Dataset-https://github.com/Whiffe/SCB-dataset) and we created object detection dataset DataMountainSCB (https://github.com/Chunyu-Dong/DataFountainSCB1) containing six types of behaviors.

Keywords: Attention mechanism; Classroom behavior recognition; Feature fusion; IoU; Object detection.