Electronic Health Records-based identification of newly diagnosed Crohn's Disease cases

Artif Intell Med. 2025 Jan:159:103032. doi: 10.1016/j.artmed.2024.103032. Epub 2024 Nov 21.

Abstract

Background: Early diagnosis and treatment of Crohn's Disease are associated with decreased risk of surgery and complications. However, diagnostic delay is frequently seen in clinical practice. To better understand Crohn's Disease risk factors and disease indicators, we identified, described, and predicted incident Crohn's Disease patients based on the Electronic Health Record data of the Mount Sinai Health System.

Methods: We developed two phenotyping algorithms based on structured Electronic Health Record data (i.e., coded diagnosis, medication prescription, and healthcare utilization), and a more simple and advanced approach of information extraction from clinical notes, including data between 2011 and 2023. We conducted an ablation study for the classification task using different models, prediction time points, data inputs, text encoding methods, and case-control matching variables.

Results: We identified 247 incident Crohn's Disease cases and 1221 matched controls and validated our cohorts through manual chart review. A second control cohort (n = 1235) was created without matching on race. Gastrointestinal symptoms were significantly overrepresented in cases at least 180 days before the first coded Crohn's Disease diagnosis. Adding text-based features to the clinical prediction models increased their overall performances. However, adding race as a matching variable had more effects on the model performance than the choice of modeling algorithm or input data, with an area under the receiver operating characteristic difference of 0.09 between the best-performing models.

Conclusion: We demonstrate the feasibility of identifying newly diagnosed Crohn's Disease patients within a United States health system using Electronic Health Records. For the predictive modeling task, cases and controls were distinguished only with modest performance, even though various state-of-the-art methods were applied based on features from structured and unstructured data. Our findings suggest the benefit of adding information from clinical notes in a supervised or unsupervised manner for cohort creation and predictive modeling.

Keywords: Clinical predictive modeling; Crohn’s disease; Diagnostic delay; Electronic health records.

MeSH terms

  • Adult
  • Algorithms*
  • Case-Control Studies
  • Crohn Disease* / diagnosis
  • Electronic Health Records*
  • Female
  • Humans
  • Male
  • Middle Aged
  • Risk Factors
  • Young Adult