Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation

Shih-Yi Lin; Chang-Cheng Jiang; Kin-Man Law; Pei-Chun Yeh; Min-Kuang Tsai; Chu-Lin Chou; I-Kuan Wang; I-Wen Ting; Yu-Wei Chen; Che-Yi Chou; Ming-Han Hsieh; Heng-Chih Pan; Sung-Lin Hsieh; Chien-Hua Chiu; Pei-Wen Lee; Yu-Cyuan Hong; Ying-Yu Hsu; Huey-Liang Kuo; Shu-Woei Ju; Chia-Hung Kao

doi:10.1177/20552076251342067

Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation

Digit Health. 2025 Jun 2:11:20552076251342067. doi: 10.1177/20552076251342067. eCollection 2025 Jan-Dec.

Authors

Shih-Yi Lin^{1

2}, Chang-Cheng Jiang^{1

2}, Kin-Man Law³, Pei-Chun Yeh⁴, Min-Kuang Tsai⁵, Chu-Lin Chou⁶, I-Kuan Wang^{1

2}, I-Wen Ting^{1

2}, Yu-Wei Chen⁵, Che-Yi Chou⁷, Ming-Han Hsieh^{1

2}, Heng-Chih Pan⁸, Sung-Lin Hsieh^{1

2}, Chien-Hua Chiu⁹, Pei-Wen Lee¹⁰, Yu-Cyuan Hong^{1

2}, Ying-Yu Hsu¹¹, Huey-Liang Kuo^{1

2}, Shu-Woei Ju^{1

2}, Chia-Hung Kao^{1

4

12

13}

Affiliations

¹ Graduate Institute of Biomedical Sciences, College of Medicine, China Medical University, Taichung, Taiwan.
² Division of Nephrology and Kidney Institute, China Medical University Hospital, Taichung, Taiwan.
³ Department of Computer Science and Engineering, National Chung Hsing University, Taichung, Taiwan.
⁴ Artificial Intelligence Center, China Medical University Hospital, Taichung, Taiwan.
⁵ Division of Nephrology, Department of Internal Medicine, Shuang Ho Hospital, Taipei Medical University, Taipei, Taiwan.
⁶ Division of Nephrology, Department of Internal Medicine, School of Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan.
⁷ Department of Nephrology, Asia University Hospital, Taichung, Taiwan.
⁸ Department of Nephrology, Keelung Chang Gung Memorial Hospital, Keelung, Taiwan.
⁹ Division of Nephrology, Department of Internal Medicine, Kaohsiung Chang Gung Memorial Hospital and Chang Gung University College of Medicine, Kaohsiung, Taiwan.
¹⁰ Department of Nephrology, Hong Deh Clinic, Taipei, Taiwan.
¹¹ National Changhua Senior High School, Changhua, Taiwan.
¹² Department of Nuclear Medicine and PET Center, China Medical University Hospital, Taichung, Taiwan.
¹³ Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan.

Abstract

Importance: Compares the responses of four AI models to common nephrology-related questions encountered in clinical settings.

Objective: To evaluate generative AI models in enhancing nephrology patient communication and education.

Design: Generative AI in Nephrology.

Setting: In a study conducted from December 8-12, 2023, and October 21-23, 2024, IT engineers evaluated GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 for nephrology patient communication and education, querying each with 21 nephrology questions and three renal biopsy reports, repeated for consistency.

Interventions for clinical trials or exposures for observational studies: None.

Main outcomes and measures: Fifteen nephrologists and one nephrology researcher assessed responses for Appropriateness, Helpfulness, Consistency, and human-like empathy, with rating scale (1-4). Using Shapiro-Wilk and Mann-Whitney U tests with Holm correction, along with TF-IDF, BertScore, and ROUGE were used. The study compared the performance of GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 across 24 nephrology-related questions.

Results: GPT-4o consistently achieved high scores in Appropriateness (3.39 ± 0.7) and Helpfulness (3.24 ± 0.73), while PaLM 2 demonstrated the highest consistency score (3.0 ± 0.86). In empathy, GPT-4 achieved the highest overall score (80.73%), excelling in patient-centric scenarios, followed by GPT-4o (76.56%). PaLM 2 showed competitive empathy in specific cases, despite scoring lower in consistency and Appropriateness.For Kidney-Related Queries, GPT-4o excelled in relevance metrics, achieving the highest BertScore (0.57) and ROUGE for one-word metrics (0.54). Gemini 1.0 Ultra led in generating coherent responses for Renal Biopsy Reports with the highest TF-IDF (0.56) and ROUGE for longest similar sentences (0.47). All 101 references provided by GPT-4 were 100% accurate.

Conclusions and relevance: GPT-4o emerged as the most accurate and consistent model across most evaluation categories, while GPT-4 demonstrated superior empathy and balanced performance. PaLM 2 and Gemini 1.0 Ultra showed strengths in specific areas, highlighting the potential for tailored applications of generative AI in nephrology clinical practice.

Keywords: GPT-4; GPT-4o; Gemini 1.0 Ultra; Generative AI; PaLM 2; nephrology.