iALP: Identification of Allergenic Proteins Based on Large Language Model and Gate Linear Unit

Bing Zhang; Jianping Zhao; Yannan Bin; Junfeng Xia

doi:10.1007/s12539-025-00734-2

iALP: Identification of Allergenic Proteins Based on Large Language Model and Gate Linear Unit

Interdiscip Sci. 2025 Jul 13. doi: 10.1007/s12539-025-00734-2. Online ahead of print.

Authors

Bing Zhang¹, Jianping Zhao², Yannan Bin³, Junfeng Xia⁴

Affiliations

¹ College of Mathematics and System Sciences, Xinjiang University, Ürümqi, 830000, China.
² College of Mathematics and System Sciences, Xinjiang University, Ürümqi, 830000, China. jpzhao@xju.edu.cn.
³ Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230000, China.
⁴ Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230000, China. jfxia@ahu.edu.cn.

PMID: 40652417
DOI: 10.1007/s12539-025-00734-2

Abstract

The rising incidence of allergic disorders has emerged as a pressing public health issue worldwide, underscoring the need for intensified research and efficacious intervention measures. Accurate identification of allergenic proteins (ALPs) is essential in preventing allergic reactions and mitigating health risks at an individual level. Although machine learning and deep learning techniques have been widely applied in ALP identification, existing methods often have limitations in capturing their complex features. In response, we introduce a novel method iALP, which leverages a large language model ProtT5 and the gate linear unit (GLU) for ALP identification with high efficacy. The advanced features in ProtT5 enable an in-depth analysis of the complex characteristics of ALPs, while GLU captures the intricate nonlinear features hidden within these proteins. The results demonstrate that iALP achieves an impressive accuracy and F1-score of 0.957 on the test set. Furthermore, it demonstrates superior performance compared to the leading predictors in a separate dataset. We also provide a detailed discussion of the model performance with protein sequences shorter than 100 amino acids. We hope that iALP will facilitate accurate ALP prediction, thereby supporting effective allergy symptom prevention and the implementation of allergen prevention and treatment strategies. The iALP source codes and datasets for prediction tasks can be accessed from the GitHub repository located at https://github.com/xialab-ahu/iALP.git .

Keywords: Allergenic protein; Deep learning; Gated linear unit; Large language model; Sequence feature.

Abstract

Grants and funding