The use of large language models in detecting Chinese ultrasound report errors

Yuqi Yan; Kai Wang; Bojian Feng; Jincao Yao; Tian Jiang; Zhiyan Jin; Yin Zheng; Yahan Zhou; Chen Chen; Lin Sui; Xiayi Chen; Yanhong Du; Jie Yang; Qianmeng Pan; Lingyan Zhou; Vicky Yang Wang; Ping Liang; Dong Xu

doi:10.1038/s41746-025-01468-7

The use of large language models in detecting Chinese ultrasound report errors

NPJ Digit Med. 2025 Jan 28;8(1):66. doi: 10.1038/s41746-025-01468-7.

Authors

Yuqi Yan^#^{1

2

3

4

5

6}, Kai Wang^#⁷, Bojian Feng^#^{1

2

3

5}, Jincao Yao^{1

5}, Tian Jiang¹, Zhiyan Jin^{1

2

3

4

6}, Yin Zheng¹, Yahan Zhou^{2

3}, Chen Chen¹, Lin Sui^{1

2

3

4

6}, Xiayi Chen^{1

2

3

5}, Yanhong Du⁷, Jie Yang⁷, Qianmeng Pan⁸, Lingyan Zhou⁹, Vicky Yang Wang^{10

11

12

13}, Ping Liang¹⁴, Dong Xu^{15

16

17

18

19

20}

Affiliations

¹ Department of Diagnostic Ultrasound Imaging & Interventional Therapy, Zhejiang Cancer Hospital, Hangzhou, Zhejiang, China.
² Center of Intelligent Diagnosis and Therapy (Taizhou), Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Taizhou, Zhejiang, China.
³ Wenling Institute of Big Data and Artificial Intelligence Institute in Medicine, Taizhou, Zhejiang, China.
⁴ Taizhou Key Laboratory of Minimally Invasive Interventional Therapy & Artificial Intelligence, Taizhou Branch of Zhejiang Cancer Hospital (Taizhou Cancer Hospital), Taizhou, Zhejiang, China.
⁵ Interventional Medicine and Engineering Research Center, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, Zhejiang, China.
⁶ Postgraduate training base Alliance of Wenzhou Medical University, Hangzhou, Zhejiang, China.
⁷ Department of Ultrasound, The Affiliated Dongyang Hospital of Wenzhou Medical University, Dongyang, Zhejiang, China.
⁸ Department of Ultrasound, Taizhou Campus of Zhejiang Cancer Hospital (Taizhou Cancer Hospital), Taizhou, Zhejiang, China.
⁹ Department of Diagnostic Ultrasound Imaging & Interventional Therapy, Zhejiang Cancer Hospital, Hangzhou, Zhejiang, China. zhouly@zjcc.org.cn.
¹⁰ Center of Intelligent Diagnosis and Therapy (Taizhou), Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Taizhou, Zhejiang, China. wangyang@waiim.org.cn.
¹¹ Wenling Institute of Big Data and Artificial Intelligence Institute in Medicine, Taizhou, Zhejiang, China. wangyang@waiim.org.cn.
¹² Taizhou Key Laboratory of Minimally Invasive Interventional Therapy & Artificial Intelligence, Taizhou Branch of Zhejiang Cancer Hospital (Taizhou Cancer Hospital), Taizhou, Zhejiang, China. wangyang@waiim.org.cn.
¹³ Interventional Medicine and Engineering Research Center, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, Zhejiang, China. wangyang@waiim.org.cn.
¹⁴ Department of Ultrasound, Chinese PLA General Hospital, Chinese PLA Medical School, Beijing, China. liangping301@126.com.
¹⁵ Department of Diagnostic Ultrasound Imaging & Interventional Therapy, Zhejiang Cancer Hospital, Hangzhou, Zhejiang, China. xudong@zjcc.org.cn.
¹⁶ Center of Intelligent Diagnosis and Therapy (Taizhou), Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Taizhou, Zhejiang, China. xudong@zjcc.org.cn.
¹⁷ Wenling Institute of Big Data and Artificial Intelligence Institute in Medicine, Taizhou, Zhejiang, China. xudong@zjcc.org.cn.
¹⁸ Taizhou Key Laboratory of Minimally Invasive Interventional Therapy & Artificial Intelligence, Taizhou Branch of Zhejiang Cancer Hospital (Taizhou Cancer Hospital), Taizhou, Zhejiang, China. xudong@zjcc.org.cn.
¹⁹ Interventional Medicine and Engineering Research Center, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, Zhejiang, China. xudong@zjcc.org.cn.
²⁰ Department of Ultrasound, Taizhou Campus of Zhejiang Cancer Hospital (Taizhou Cancer Hospital), Taizhou, Zhejiang, China. xudong@zjcc.org.cn.

^# Contributed equally.

Abstract

This retrospective study evaluated the efficacy of large language models (LLMs) in improving the accuracy of Chinese ultrasound reports. Data from three hospitals (January-April 2024) including 400 reports with 243 errors across six categories were analyzed. Three GPT versions and Claude 3.5 Sonnet were tested in zero-shot settings, with the top two models further assessed in few-shot scenarios. Six radiologists of varying experience levels performed error detection on a randomly selected test set. In zero-shot setting, Claude 3.5 Sonnet and GPT-4o achieved the highest error detection rates (52.3% and 41.2%, respectively). In few-shot, Claude 3.5 Sonnet outperformed senior and resident radiologists, while GPT-4o excelled in spelling error detection. LLMs processed reports faster than the quickest radiologist (Claude 3.5 Sonnet: 13.2 s, GPT-4o: 15.0 s, radiologist: 42.0 s per report). This study demonstrates the potential of LLMs to enhance ultrasound report accuracy, outperforming human experts in certain aspects.