Cloud-Enabled Scalable Analysis of Large Proteomics Cohorts

Harendra Guturu; Andrew Nichols; Lee S Cantrell; Seth Just; Janos Kis; Theodore Platt; Iman Mohtashemi; Jian Wang; Serafim Batzoglou

doi:10.1021/acs.jproteome.4c00771

Cloud-Enabled Scalable Analysis of Large Proteomics Cohorts

J Proteome Res. 2025 Mar 7;24(3):1462-1469. doi: 10.1021/acs.jproteome.4c00771. Epub 2025 Feb 13.

Authors

Harendra Guturu¹, Andrew Nichols¹, Lee S Cantrell¹, Seth Just¹, Janos Kis¹, Theodore Platt¹, Iman Mohtashemi¹, Jian Wang¹, Serafim Batzoglou¹

Affiliation

¹ Seer, Inc., Redwood City, California 94065, United States.

PMID: 39946685
DOI: 10.1021/acs.jproteome.4c00771

Abstract

Rapid advances in depth and throughput of untargeted mass-spectrometry-based proteomic technologies enable large-scale cohort proteomic and proteogenomic analyses. As such, the data infrastructure and search engines required to process data must also scale. This challenge is amplified in search engines that rely on library-free match between runs (MBR) search, which enable enhanced depth-per-sample and data completeness. However, to date, no MBR-based search could scale to process cohorts of thousands or more individuals. Here, we present a strategy to deploy search engines in a distributed cloud environment without source code modification, thereby enhancing resource scalability and throughput. Additionally, we present an algorithm, Scalable MBR, that replicates the MBR procedure of popular DIA-NN software for scalability to thousands of samples. We demonstrate that Scalable MBR can search thousands of MS raw files in a few hours compared to days required for the original DIA-NN MBR procedure and demonstrate that the results are almost indistinguishable to those of DIA-NN native MBR. We additionally show that empirical spectra generated by Scalable MBR better approximates DIA-NN native MBR compared to semiempirical alternatives such as ID-RT-IM MBR, preserving user choice to use empirical libraries in large cohort analysis. The method has been tested to scale to over 15,000 injections and is available for use in the Proteograph Analysis Suite.

Keywords: bioinformatics; cloud search; computational biology; data-independent acquisition; mass spectrometry; match-between-runs; population proteomics; proteograph; proteomics; scalable computing.

MeSH terms

Algorithms
Cloud Computing*
Cohort Studies
Humans
Mass Spectrometry / methods
Proteomics* / methods
Search Engine* / methods
Software