In cross-cohort studies, integrating diverse datasets, such as electronic health records (EHRs), is both essential and challenging due to cohort-specific variations, distributed data storage, and data privacy concerns. Traditional methods often require data pooling or complex data harmonization, which can reduce efficiency and limit the scope of cross-cohort learning. We introduce mixWAS, a one-shot, lossless algorithm that efficiently integrates distributed EHR datasets via summary statistics. Unlike existing approaches, mixWAS preserves cohort-specific covariate associations and supports simultaneous mixed-outcome analyses. Simulations demonstrate that mixWAS outperforms conventional methods in accuracy and efficiency across various scenarios. Applied to EHR data from seven cohorts in the US, mixWAS identified 4,530 significant cross-cohort genetic associations among traits such as blood lipids, BMI, and circulatory diseases. Validation with an independent UK EHR dataset confirmed 97.7% of these associations, underscoring the algorithm's robustness. By enabling lossless cross-cohort integration, mixWAS improves the precision of multi-outcome analyses and expands the potential for actionable insights in healthcare research.
The bigger picture: Cross-cohort integration of electronic health record (EHR) datasets is critical for advancing genomic discovery but remains hindered by privacy concerns, cohort heterogeneity, and computational limitations. Traditional meta-analysis and federated methods either lose power or cannot fully model multiple mixed-outcome traits across distributed datasets. To address this, we developed mixWAS, a one-shot, lossless algorithm for integrating summary statistics across cohorts without sharing individual-level data. mixWAS simultaneously models binary and continuous outcomes, accounts for site-specific covariate heterogeneity, and requires only a single communication step between sites. Through extensive simulations and real data analyses, mixWAS consistently outperformed traditional Phenome-Wide Association Studies (PheWAS) and other multi-trait approaches in detecting multi-phenotype associations (MPAs). eyond genetic applications, mixWAS offers a general framework for distributed analysis of mixed-outcome data, with broad potential across biomedicine, public health, and other fields requiring privacy- preserving data integration.
Highlights: mixWAS enables lossless, one-shot cross-cohort integration of summary statisticsSimultaneously models binary and continuous outcomes across distributed datasetsOutperforms PheWAS in detecting multi-phenotype associations (MPA)Offers a general framework for distributed analysis of mixed-outcome data.