Scaling Structure Aware Virtual Screening to Billions of Molecules with SPRINT

Andrew T McNutt; Abhinav K Adduri; Caleb N Ellington; Monica T Dayao; Eric P Xing; Hosein Mohimani; David R Koes

Scaling Structure Aware Virtual Screening to Billions of Molecules with SPRINT

ArXiv [Preprint]. 2025 Jan 20:arXiv:2411.15418v2.

Authors

Andrew T McNutt¹, Abhinav K Adduri², Caleb N Ellington², Monica T Dayao², Eric P Xing^{2

3

4}, Hosein Mohimani², David R Koes¹

Affiliations

¹ Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA.
² Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA.
³ Mohamed Bin Zayed University of Artificial Intelligence, Masdar City, Abu Dhabi.
⁴ Petuum Inc., Pittsburgh, PA.

PMID: 39975427
PMCID: PMC11838698

Abstract

Virtual screening of small molecules against protein targets can accelerate drug discovery and development by predicting drug-target interactions (DTIs). However, structure-based methods like molecular docking are too slow to allow for broad proteome-scale screens, limiting their application in screening for off-target effects or new molecular mechanisms. Recently, vector-based methods using protein language models (PLMs) have emerged as a complementary approach that bypasses explicit 3D structure modeling. Here, we develop SPRINT, a vector-based approach for screening entire chemical libraries against whole proteomes for DTIs and novel mechanisms of action. SPRINT improves on prior work by using a self-attention based architecture and structure-aware PLMs to learn a co-embedding space for drugs and targets, enabling efficient binder prediction, search, and retrieval. SPRINT achieves SOTA enrichment factors in virtual screening on LIT-PCBA, DTI classification benchmarks, and binding affinity prediction benchmarks, while providing interpretability in the form of residue-level attention maps. In addition to being both accurate and interpretable, SPRINT is ultra-fast: querying the whole human proteome against the ENAMINE Real Database (6.7B drugs) for the 100 most likely binders per protein takes 16 minutes. SPRINT promises to enable virtual screening at an unprecedented scale, opening up new opportunities for in silico drug repurposing and development. SPRINT is available on the web as ColabScreen: https://bit.ly/colab-screen.

Keywords: protein language models; structure-based drug discovery; virtual screening.

Publication types

Preprint

Abstract

Publication types

Grants and funding