DeepSeek vs ChatGPT: a comparison study of their performance in answering prostate cancer radiotherapy questions in multiple languages

Am J Clin Exp Urol. 2025 Apr 25;13(2):176-185. doi: 10.62347/UIAP7979. eCollection 2025.

Abstract

Introduction: The medical information generated by large language models (LLM) is crucial for improving patient education and clinical decision-making. This study aims to evaluate the performance of two LLMs (DeepSeek and ChatGPT) in answering questions related to prostate cancer radiotherapy in both Chinese and English environments. Through a comparative analysis, we aim to determine which model can provide higher-quality answers in different language environments.

Methods: A structured evaluation framework was developed using a set of clinically relevant questions covering three key domains: foundational knowledge, patient education, and treatment and follow-up care. Responses from DeepSeek and ChatGPT were generated in both English and Chinese and independently assessed by a panel of five oncology specialists using a five-point Likert scale. Statistical analyses, including the Wilcoxon signed-rank test, were performed to compare the models' performance across different linguistic contexts.

Results: This study ultimately included 33 questions for scoring. In Chinese, DeepSeek outperformed ChatGPT, achieving top ratings (score = 5) in 75.76% vs. 36.36% of responses (P < 0.001), excelling in foundational knowledge (76.92% vs. 38.46%, P = 0.047) and treatment/follow-up (81.82% vs. 36.36%, P = 0.031). In English, ChatGPT showed comparable performance (66.7% vs. 54.55% top-rated responses, P = 0.236), with marginal advantages in treatment/follow-up (63.64% vs. 54.55%, P = 0.563). DeepSeek maintained strengths in English foundational knowledge (69.23% vs. 30.77%, P = 0.047) and patient education (88.89% vs. 55.56%, P = 0.125). These findings underscore DeepSeek's superior Chinese proficiency and language-specific optimization impacts.

Conclusions: This study shows that DeepSeek performs excellently in providing Chinese medical information, while the two models perform similarly in an English environment. These findings underscore the importance of selecting language-specific artificial intelligence (AI) models to enhance the accuracy and reliability of medical AI applications. While both models show promise in supporting patient education and clinical decision-making, human expert review remains necessary to ensure response accuracy and minimize potential misinformation.

Keywords: Artificial intelligence; ChatGPT; DeepSeek; prostate cancer; radiotherapy.