Sparse Bernoulli mixture modeling with negative-unlabeled data: an approach to identify and characterize long COVID

Biometrics. 2025 Jan 7;81(1):ujaf021. doi: 10.1093/biomtc/ujaf021.

Abstract

SARS-CoV-2-infected individuals have reported a diverse collection of persistent and often debilitating symptoms commonly referred to as long COVID or post-acute sequelae of SARS-CoV-2 (PASC). Identifying PASC and its subphenotypes is challenging because available data are "negative-unlabeled" as uninfected individuals must be PASC negative, but those with prior infection have unknown PASC status. Moreover, feature selection among many potentially informative characteristics can facilitate reaching a concise and easily interpretable PASC definition. Therefore, to characterize PASC and the spectrum of PASC subphenotypes while identifying a minimal set of features, we propose a Bernoulli mixture model with novel parameterization to accommodate negative-unlabeled data and Bayesian priors to induce sparsity. We present an efficient expectation-maximization algorithm for estimation, and a grid search procedure to select the number of clusters and level of sparsity. We evaluate the proposed method with a simulation study and an analysis of data on self-reported symptoms from the ongoing Researching COVID to Enhance Recovery-Adult Cohort study.

Keywords: Bernoulli mixture model; SARS-CoV-2; feature selection; long COVID; negative-unlabeled data; sparse prior.

MeSH terms

  • Algorithms
  • Bayes Theorem
  • COVID-19* / complications
  • COVID-19* / epidemiology
  • Computer Simulation
  • Humans
  • Models, Statistical*
  • Post-Acute COVID-19 Syndrome
  • SARS-CoV-2

Grants and funding