Machine learning enabled protein secondary structure characterization using drop-coating deposition Raman spectroscopy

J Pharm Biomed Anal. 2025 Jul 15:259:116762. doi: 10.1016/j.jpba.2025.116762. Epub 2025 Feb 25.

Abstract

Protein structure characterization is critical for therapeutic protein drug development and production. Drop-coating deposition Raman (DCDR) spectroscopy offers rapid and cost-effective acquisition of vibrational spectral data characteristic of protein secondary structures. Amide I region (1600 -1700 cm-1) and amide II region (1500-1600 cm-1) of DCRD Raman spectra measured for model proteins of varying molecular size and structural distribution were first analyzed by peak fitting for their proportions of six secondary structure motifs: α-helices, 310-helices, β-sheets, turns (β-turns and γ-turns), bends, and random coil. The high spectral resolution and superior signal-to-noise of DCDR spectra made it possible to estimate all six structural motifs at accuracy comparable to X-ray crystallographic measurement. The ease of DCDR measurement was further explored by introducing machine learning algorithm to spectroscopic data analysis. Partial Least Squares (PLS) regression modeling was used as a machine learning tool to predict the protein secondary structural composition from the amide I band of model proteins. Once developed on a training sample set, the PLS model was tested by applying to a sample set that was not used previously for model development. Low prediction errors were achieved at 1.36 %, 0.78 %, 0.42 % 0.41 %, 0.81 %, and 0.52 %, respectively for the six structural component, α-Helix, β-Sheet, 310-helices, random, turns, and bends. The PLS model was further tested on an independent sample set that contains three IgG proteins. The proportion ofα-Helix, β-Sheet, 310-Helix were estimated with an error of 3.1 %, 2.3 % and 2.8 %, respectively.

Keywords: Drop-coating deposition; Machine learning; Partial least squares; Peak fitting; Protein secondary structure; Raman spectroscopy.

MeSH terms

  • Algorithms
  • Least-Squares Analysis
  • Machine Learning*
  • Protein Structure, Secondary
  • Proteins* / chemistry
  • Spectrum Analysis, Raman* / methods

Substances

  • Proteins