Deep learning-based in-ambulance speech recognition and generation of prehospital emergency diagnostic summaries using LLMs

Chen Chen; Yingying Hu; Wenwei Cai; Huibin Pan; Meihong Shen; Yujie Zhai; Shanhui Wu; Qunyi Zhou; Yi Guo

doi:10.1016/j.ijmedinf.2025.106029

Deep learning-based in-ambulance speech recognition and generation of prehospital emergency diagnostic summaries using LLMs

Int J Med Inform. 2025 Jul 7:203:106029. doi: 10.1016/j.ijmedinf.2025.106029. Online ahead of print.

Authors

Chen Chen¹, Yingying Hu², Wenwei Cai³, Huibin Pan⁴, Meihong Shen⁵, Yujie Zhai⁵, Shanhui Wu⁶, Qunyi Zhou¹, Yi Guo⁷

Affiliations

¹ School of Information and Electornic Engineering, Zhejiang University of Science and Technology, Hangzhou, China.
² Zhejiang Health Information Center, Hangzhou, China.
³ Department of Emergency Medicine, Zhejiang Provincial People's Hospital, Hangzhou, China; Zhejiang Emergency Command Center, Hangzhou, China.
⁴ Department of Emergency Medicine, The First People's Hospital of Huzhou, Huzhou, China.
⁵ Huzhou Emergency Medical Center, Huzhou, China.
⁶ Yuhang Branch of Hangzhou Emergency Medical Center, Hangzhou, China.
⁷ Zhejiang Health Information Center, Hangzhou, China. Electronic address: guoy@zjwjw.gov.cn.

PMID: 40639122
DOI: 10.1016/j.ijmedinf.2025.106029

Abstract

Objective: The timely and accurate submission of prehospital electronic medical records is crucial for the efficiency of medical rescue operations. However, personnel professional experience, training cycles, and environmental conditions often influence its completion rate. This study proposes integrating noise-robust speech recognition technology with large language models (LLMs) to generate emergency diagnosis summaries. This approach aims to help medical personnel quickly document key patient information, streamlining the emergency response process.

Methods: A joint training model combining speech enhancement and recognition was proposed, incorporating LLMs to generate emergency diagnosis summaries. The model was trained in two rounds using actual ambulance noise data, environmental noise data, and open-source speech datasets. The model optimized Connectionist Temporal Classification(CTC) and attention loss through deep feature extraction and the selective attention mechanism. The study also analyzed the impact of different prompt designs on the quality of LLMs-generated summaries. Tukey HSD and Holm correction methods were employed for multiple comparisons of three subjective evaluation metrics under three prompts for three models, assessing the statistical significance of each factor's influence on the generation results.

Results: The proposed speech recognition model reduced the character error rate in real-world ambulance noise recordings to 52.92%, outperforming several comparative speech recognition models. Under the Stylized Prompt condition, the Qwen2.5-7B-Instruct model demonstrated superior accuracy and relevance compared to other models in terms of subjectivity and relevance, reducing the average completion time for prehospital electronic medical records from 20 min to 14 min.

Conclusion: Using noise-robust speech recognition combined with LLMs to generate emergency diagnosis summaries improves efficiency and enhances medical record completion. This approach demonstrates broad application potential in emergencies and could be extended to quality evaluation, disease prediction, and risk assessment.

Keywords: Deep learning; Large language models; Speech enhancement; Speech recognition.