Impacts of sample weighting on transferability of risk prediction models across EHR-Linked biobanks with different recruitment strategies

J Biomed Inform. 2025 Jul:167:104853. doi: 10.1016/j.jbi.2025.104853. Epub 2025 May 19.

Abstract

Objective: To evaluate whether using poststratification weights when training risk prediction models enhances transferability when the external test cohort has a different sampling strategy, a commonly encountered scenario when analyzing electronic health record (EHR)-linked biobanks.

Methods: PS weights were calculated to align a health system-based biobank, the Michigan Genomics Initiative (MGI; n = 76,757), with a nationally recruited biobank, All of Us (AOU; n = 226,764), which oversamples underrepresented groups. Basic PS weights (PSBASIC) captured age, sex, and race/ethnicity; full PS weights (PSFULL) additionally included smoking, alcohol consumption, BMI, depression, hypertension, and the Charlson Comorbidity Index. Models for esophageal, liver, and pancreatic cancers were developed using EHR data from MGI at 0, 1, 2, and 5 years prior to diagnosis. Phenotype risk scores (PheRS) were constructed using six methods (e.g., regularized regression, random forest) and evaluated alongside covariates, risk factors, and symptoms. Evaluation metrics included the odds ratio (OR) for the top decile vs. the middle 40th-60th percentiles of the risk score distribution and the area under the receiver operating curve (AUC) evaluated in the AOU test cohort when models are trained with and without weighting.

Results: Elastic net and random forest methods generally performed well in risk stratification, but no single PheRS construction method consistently outperformed others. Applying PS weights did not consistently improve risk stratification performance. For example, in liver cancer risk stratification at t = 1, unweighted random forest PheRS yielded an OR of 13.73 (95 % CI: 8.97, 21.01), compared to 14.55 (95 % CI: 9.45, 22.42) with PSBASIC and 13.62 (95 % CI: 8.90, 20.85) with PSFULL.

Conclusion: PS weights do not significantly enhance risk model transferability between biobanks. EHR-based PheRS are crucial for risk stratification and should be integrated with other multimodal data for improved risk prediction. Identifying high-risk populations for diseases like liver cancer early through health history mining shows promise.

Keywords: Biobank; Cancer; Electronic health records; Phenome; Risk prediction; Sampling weights; Selection bias; Variable selection.

MeSH terms

  • Adult
  • Aged
  • Biological Specimen Banks*
  • Electronic Health Records*
  • Female
  • Humans
  • Male
  • Middle Aged
  • Patient Selection*
  • Risk Assessment / methods
  • Risk Factors