Efficient storage and regression computation for population-scale genome sequencing studies

Manuel A Rivas; Christopher Chang

doi:10.1093/bioinformatics/btaf067

Efficient storage and regression computation for population-scale genome sequencing studies

Bioinformatics. 2025 Mar 4;41(3):btaf067. doi: 10.1093/bioinformatics/btaf067.

Authors

Manuel A Rivas¹, Christopher Chang²

Affiliations

¹ Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, United States.
² Grail Inc., Menlo Park, CA 94025, United States.

Abstract

Motivation: The growing availability of large-scale population biobanks has the potential to significantly advance our understanding of human health and disease. However, the massive computational and storage demands of whole genome sequencing (WGS) data pose serious challenges, particularly for underfunded institutions or researchers in developing countries. This disparity in resources can limit equitable access to cutting-edge genetic research.

Results: We present novel algorithms and regression methods that dramatically reduce both computation time and storage requirements for WGS studies, with particular attention to rare variant representation. By integrating these approaches into PLINK 2.0, we demonstrate substantial gains in efficiency without compromising analytical accuracy. In an exome-wide association analysis of 19.4 million variants for the body mass index phenotype in 125 077 individuals (AllofUs project data), we reduced runtime from 695.35 min (11.5 h) on a single machine to 1.57 min with 30 GB of memory and 50 threads (or 8.67 min with 4 threads). Additionally, the framework supports multi-phenotype analyses, further enhancing its flexibility.

Availability and implementation: Our optimized methods are fully integrated into PLINK 2.0 and can be accessed at: https://www.cog-genomics.org/plink/2.0/.

MeSH terms

Algorithms
Computational Biology / methods
Genome, Human*
Genome-Wide Association Study
Humans
Regression Analysis
Software
Whole Genome Sequencing* / methods

Abstract

MeSH terms

Grants and funding