SARS-CoV-2-infected individuals have reported a diverse collection of persistent and often debilitating symptoms commonly referred to as long COVID or post-acute sequelae of SARS-CoV-2 (PASC). Identifying PASC and its subphenotypes is challenging because available data are "negative-unlabeled" as uninfected individuals must be PASC negative, but those with prior infection have unknown PASC status. Moreover, feature selection among many potentially informative characteristics can facilitate reaching a concise and easily interpretable PASC definition. Therefore, to characterize PASC and the spectrum of PASC subphenotypes while identifying a minimal set of features, we propose a Bernoulli mixture model with novel parameterization to accommodate negative-unlabeled data and Bayesian priors to induce sparsity. We present an efficient expectation-maximization algorithm for estimation, and a grid search procedure to select the number of clusters and level of sparsity. We evaluate the proposed method with a simulation study and an analysis of data on self-reported symptoms from the ongoing Researching COVID to Enhance Recovery-Adult Cohort study.
Keywords: Bernoulli mixture model; SARS-CoV-2; feature selection; long COVID; negative-unlabeled data; sparse prior.
© The Author(s) 2025. Published by Oxford University Press on behalf of The International Biometric Society.