Analysis and Research on Spectrogram-Based Emotional Speech Signal Augmentation Algorithm

Huawei Tao; Sixian Li; Xuemei Wang; Binkun Liu; Shuailong Zheng

doi:10.3390/e27060640

Analysis and Research on Spectrogram-Based Emotional Speech Signal Augmentation Algorithm

Entropy (Basel). 2025 Jun 15;27(6):640. doi: 10.3390/e27060640.

Authors

Huawei Tao^{1

2

3}, Sixian Li^{1

2}, Xuemei Wang^{1

2}, Binkun Liu^{1

2

3}, Shuailong Zheng^{1

2}

Affiliations

¹ Key Laboratory of Grain Information Processing and Control, Henan University of Technology, Ministry of Education, Zhengzhou 450001, China.
² Henan Key Laboratory of Grain Storage Information Intelligent Perception and Decision Making, Henan University of Technology, Zhengzhou 450001, China.
³ Institute for Complexity Science, Henan University of Technology, Zhengzhou 450001, China.

Abstract

Data augmentation techniques are widely applied in speech emotion recognition to increase the diversity of data and enhance the performance of models. However, existing research has not deeply explored the impact of these data augmentation techniques on emotional data. Inappropriate augmentation algorithms may distort emotional labels, thereby reducing the performance of models. To address this issue, in this paper we systematically evaluate the influence of common data augmentation algorithms on emotion recognition from three dimensions: (1) we design subjective auditory experiments to intuitively demonstrate the impact of augmentation algorithms on the emotional expression of speech; (2) we jointly extract multi-dimensional features from spectrograms based on the Librosa library and analyze the impact of data augmentation algorithms on the spectral features of speech signals through heatmap visualization; and (3) we objectively evaluate the recognition performance of the model by means of indicators such as cross-entropy loss and introduce statistical significance analysis to verify the effectiveness of the augmentation algorithms. The experimental results show that "time stretching" may distort speech features, affect the attribution of emotional labels, and significantly reduce the model's accuracy. In contrast, "reverberation" (RIR) and "resampling" within a limited range have the least impact on emotional data, enhancing the diversity of samples. Moreover, their combination can increase accuracy by up to 7.1%, providing a basis for optimizing data augmentation strategies.

Keywords: cross-entropy loss; data augmentation; spectrogram; speech emotion recognition.

Abstract

Grants and funding