A large-scale curated and filterable dataset for cryo-EM foundation model pre-training

Sci Data. 2025 Jun 7;12(1):960. doi: 10.1038/s41597-025-05179-2.

Abstract

Cryo-electron microscopy (cryo-EM) is a transformative imaging technology that enables near-atomic resolution 3D reconstruction of target biomolecule, playing a critical role in structural biology and drug discovery. Cryo-EM faces significant challenges due to its extremely low signal-to-noise ratio (SNR) where the complexity of data processing becomes particularly pronounced. To address this challenge, foundation models have shown great potential in other biological imaging domains. However, their application in cryo-EM has been limited by the lack of large-scale, high-quality datasets. To fill this gap, we introduce CryoCRAB, the first large-scale dataset for cryo-EM foundation models. CryoCRAB includes 746 proteins, comprising 152,385 sets of raw movie frames (116.8 TB in total). To tackle the high-noise nature of cryo-EM data, each movie is split into odd and even frames to generate paired micrographs for denoising tasks. The dataset is stored in HDF5 chunked format, significantly improving random sampling efficiency and training speed. CryoCRAB offers diverse data support for cryo-EM foundation models, enabling advancements in image denoising and general-purpose feature extraction for downstream tasks.

Publication types

  • Dataset

MeSH terms

  • Cryoelectron Microscopy*
  • Image Processing, Computer-Assisted
  • Imaging, Three-Dimensional
  • Proteins
  • Signal-To-Noise Ratio

Substances

  • Proteins