Bridging language gaps: The role of NLP and speech recognition in oral english instruction

Parul Dubey; Pushkar Dubey; Rohit Raja; Sapna Singh Kshatri

doi:10.1016/j.mex.2025.103359

Bridging language gaps: The role of NLP and speech recognition in oral english instruction

MethodsX. 2025 May 7:14:103359. doi: 10.1016/j.mex.2025.103359. eCollection 2025 Jun.

Authors

Parul Dubey¹, Pushkar Dubey², Rohit Raja³, Sapna Singh Kshatri⁴

Affiliations

¹ Symbiosis Institute of Technology, Nagpur Campus, Symbiosis International (Deemed University), Pune, India.
² Department of Management, Pandit Sundarlal Sharma (Open) University Chhattisgarh, India.
³ Department of Information Technology, School of Studies (Engineering and Technology), Guru Ghasidas Vishwavidyalaya (A Central University), India.
⁴ Department of Artificial Intelligence, Shri Shankaracharya Institute of Professional Management and Technology, Raipur, India.

Abstract

The Natural Language Processing (NLP) and speech recognition have transformed language learning by providing interactive and real-time feedback, enhancing oral English proficiency. These technologies facilitate personalized and adaptive learning, making pronunciation and fluency improvement more efficient. Traditional methods lack real-time speech assessment and individualized feedback, limiting learners' progress. Existing speech recognition models struggle with diverse accents, variations in speaking styles, and computational efficiency, reducing their effectiveness in real-world applications. This study utilizes three datasets-including a custom dataset of 882 English teachers, the CMU ARCTIC corpus, and LibriSpeech Clean-to ensure generalizability and robustness. The methodology integrates Hidden Markov Models for speech recognition, NLP-based text analysis, and computer vision-based lip movement detection to create an adaptive multimodal learning system. The novelty of this study lies in its real-time Bayesian feedback mechanism and multimodal integration of audio, visual, and textual data, enabling dynamic and personalized oral instruction. Performance is evaluated using recognition accuracy, processing speed, and statistical significance testing. The continuous HMM model achieves up to 97.5 % accuracy and significantly outperforms existing models such as MLP-LSTM and GPT-3.5-turbo (p < 0.05) across all datasets. Developed a multimodal system that combines speech, text, and visual data to enhance real-time oral English learning.•Collected and annotated a diverse dataset of English speech recordings from teachers across various accents and speaking styles.•Designed an adaptive feedback framework to provide learners with immediate, personalized insights into their pronunciation and fluency.

Keywords: Deep learning; Face multimodal natural language processing (NLP); Hidden markov models (HMMs); Multimodal NLP and Speech Recognition; Oral english instruction; Speech recognition.