Rapid advances in depth and throughput of untargeted mass-spectrometry-based proteomic technologies enable large-scale cohort proteomic and proteogenomic analyses. As such, the data infrastructure and search engines required to process data must also scale. This challenge is amplified in search engines that rely on library-free match between runs (MBR) search, which enable enhanced depth-per-sample and data completeness. However, to date, no MBR-based search could scale to process cohorts of thousands or more individuals. Here, we present a strategy to deploy search engines in a distributed cloud environment without source code modification, thereby enhancing resource scalability and throughput. Additionally, we present an algorithm, Scalable MBR, that replicates the MBR procedure of popular DIA-NN software for scalability to thousands of samples. We demonstrate that Scalable MBR can search thousands of MS raw files in a few hours compared to days required for the original DIA-NN MBR procedure and demonstrate that the results are almost indistinguishable to those of DIA-NN native MBR. We additionally show that empirical spectra generated by Scalable MBR better approximates DIA-NN native MBR compared to semiempirical alternatives such as ID-RT-IM MBR, preserving user choice to use empirical libraries in large cohort analysis. The method has been tested to scale to over 15,000 injections and is available for use in the Proteograph Analysis Suite.
Keywords: bioinformatics; cloud search; computational biology; data-independent acquisition; mass spectrometry; match-between-runs; population proteomics; proteograph; proteomics; scalable computing.