Efficient storage and regression computation for population-scale genome sequencing studies

Bioinformatics. 2025 Mar 4;41(3):btaf067. doi: 10.1093/bioinformatics/btaf067.

Abstract

Motivation: The growing availability of large-scale population biobanks has the potential to significantly advance our understanding of human health and disease. However, the massive computational and storage demands of whole genome sequencing (WGS) data pose serious challenges, particularly for underfunded institutions or researchers in developing countries. This disparity in resources can limit equitable access to cutting-edge genetic research.

Results: We present novel algorithms and regression methods that dramatically reduce both computation time and storage requirements for WGS studies, with particular attention to rare variant representation. By integrating these approaches into PLINK 2.0, we demonstrate substantial gains in efficiency without compromising analytical accuracy. In an exome-wide association analysis of 19.4 million variants for the body mass index phenotype in 125 077 individuals (AllofUs project data), we reduced runtime from 695.35 min (11.5 h) on a single machine to 1.57 min with 30 GB of memory and 50 threads (or 8.67 min with 4 threads). Additionally, the framework supports multi-phenotype analyses, further enhancing its flexibility.

Availability and implementation: Our optimized methods are fully integrated into PLINK 2.0 and can be accessed at: https://www.cog-genomics.org/plink/2.0/.

MeSH terms

  • Algorithms
  • Computational Biology / methods
  • Genome, Human*
  • Genome-Wide Association Study
  • Humans
  • Regression Analysis
  • Software
  • Whole Genome Sequencing* / methods