PLMSearch and PLMAlign: Protein Language Model (PLM)-Based Homologous Protein Sequence Search and Alignment

Methods Mol Biol. 2025:2941:227-241. doi: 10.1007/978-1-0716-4623-6_14.

Abstract

Homologous protein search and sequence alignment are fundamental tasks in computational biology. While structure search methods can provide valuable insights, inferring distant evolutionary relationships with sequences alone remains challenging. Here, we introduce PLMSearch and PLMAlign, two protein language model (PLM)-based tools for searching and aligning remote homologous sequences. PLMSearch is a novel method for homologous protein search with only sequences as input. It leverages deep representations from a pretrained protein language model and trains the similarity prediction model with a large number of real structure similarities, enabling it to detect remote homologies embedded behind the sequences. PLMAlign is a protein sequence alignment tool designed to accurately align the identified homologous pairs. We offer two distinct Web servers for PLMSearch and PLMAlign, freely available at https://dmiip.sjtu.edu.cn/PLMSearch and https://dmiip.sjtu.edu.cn/PLMAlign , respectively. We demonstrate how to operate the Web servers and interpret the resulting outputs. Furthermore, we offer a large range of free and open resources, including open-source code, datasets, and containers, all aimed at advancing scientific research. We also demonstrate how these resources can be leveraged to run PLMSearch and PLMAlign either locally or in the cloud.

Keywords: Homologous protein search; Large-scale vector database search; Protein language model; Protein sequence alignment; Protein structure similarity prediction.

MeSH terms

  • Algorithms
  • Amino Acid Sequence
  • Computational Biology* / methods
  • Databases, Protein
  • Internet
  • Proteins* / chemistry
  • Proteins* / genetics
  • Sequence Alignment* / methods
  • Sequence Analysis, Protein* / methods
  • Sequence Homology, Amino Acid
  • Software*

Substances

  • Proteins