Objectives: Use a large language model (LLM) to examine the content and quality of narrative feedback provided to residents through: (1) an app collecting workplace-based assessments of surgical performance (SIMPL-OR), (2) Objective Structured Assessment of Technical Skills (OSATS), and (3) end-of-rotation (EOR) evaluations.
Methods: Narrative feedback provided to residents at a single institution from 2017 to 2021 was examined. Sixty entries (20 of each format) were evaluated by two faculty members on whether they were encouraging, corrective, or specific, and whether they addressed the Core Competencies outlined by the Accreditation Council for Graduate Medical Education. ChatGPT4o was tested on these 60 entries before evaluating the remaining 776 entries.
Results: ChatGPT evaluated entries with 90% concordance with faculty (κ = 0.94). Within the 776 feedback entries evaluated by ChatGPT, competencies addressed included: patient care (n = 491, 97% vs. 77% vs. 36% for SIMPL-OR, OSATS, EOR respectively, p < 0.001), practice-based learning (n = 175, 32% vs. 23% vs. 16%, p < 0.001), professionalism (n = 168, 1% vs. 6% vs. 40%, p < 0.001), medical knowledge (n = 95, 7% vs. 8% vs. 17%, p < 0.001), interpersonal and communication skills (n = 59, 3% vs. 3% vs. 12%, p < 0.001), and systems-based practice (n = 31, 4% vs. 2% vs. 5%, p = 0.387). Feedback was "encouraging" in 93% of both SIMPL-OR and OSATS, as compared to 84% of EOR (p < 0.001). Feedback was "corrective" in 71% of SIMPL-OR versus 44% of OSATS versus 24% of EOR (p < 0.001), and "specific" in 97% versus 53% versus 15%, respectively (p < 0.001).
Conclusion: Different instruments provided feedback of differing content and quality and a multimodal feedback approach is important.
Level of evidence: N/A.
Keywords: medical education; natural language processing; resident education.
© 2025 The American Laryngological, Rhinological and Otological Society, Inc.