Estimating depression severity in narrative clinical notes using large language models

J Affect Disord. 2025 Jul 15:381:270-274. doi: 10.1016/j.jad.2025.04.014. Epub 2025 Apr 3.

Abstract

Background: Depression treatment guidelines emphasize measurement-based care using patient-reported outcome measures, yet their impact on narrative documentation quality remains underexplored.

Methods: We sampled 15,000 narrative clinical outpatient notes from the electronic health record of a large academic medical center, reflecting visits between January 2, 2019 and January 30, 2024, for which a 9-item Patient Health Questionnaire (PHQ-9) was completed at the same time. After censoring PHQ-9 scores from notes, we estimated severity of depressive symptoms with a foundational large language model (gpt4o-08-06) in a HIPAA-compliant enclave. We estimated correlation between true PHQ-9 and model-estimated score and examined the predictive performance of the model for moderate or greater depressive symptoms.

Results: Mean age was 46.3 years (SD 14.9); 9083 (60.6 %) identified as female. 925 (6.2 %) identified as Asian, 638 (4.3 %) as Black, 853 (5.7 %) as another race, and 12,187 (81.2 %) as White. A total of 1044 (7.0 %) identified as Hispanic ethnicity, while 12,699 (84.7 %) were non-Hispanic. Mean measured PHQ-9 score was 1.23 (SD 3.45); 721 (4.8 %) met criteria for moderate or greater depressive symptoms. LLM-predicted PHQ-9 scores were modestly correlated with actual scores (r2 = 0.264 (95 % CI 0.252-0.276)); PPV for moderate or greater depression was 0.309 (95 % CI 0.302-0.317). Performance was consistent across demographic subgroups, with modest differences identified by race, ethnicity, and sex.

Conclusion: A foundational LLM performed poorly but consistently across subgroups in imputing PHQ-9 scores from notes when actual PHQ-9 reporting was ablated. This result suggests the extent to which inclusion of PROMs may impoverish documentation of psychiatric symptoms.

Keywords: Artificial intelligence; Depression severity estimation; Large language models (LLMs); Machine learning; Predictive modeling.

MeSH terms

  • Adult
  • Aged
  • Depression* / diagnosis
  • Electronic Health Records* / statistics & numerical data
  • Female
  • Humans
  • Large Language Models
  • Male
  • Middle Aged
  • Narration*
  • Patient Health Questionnaire
  • Severity of Illness Index