Detecting interspecific positive selection using convolutional neural networks

Charlotte West; Conor R Walker; Shayesteh Arasti; Viacheslav Vasilev; Xingze Xu; Nicola De Maio; Nick Goldman

doi:10.1093/molbev/msaf154

Detecting interspecific positive selection using convolutional neural networks

Mol Biol Evol. 2025 Jun 30:msaf154. doi: 10.1093/molbev/msaf154. Online ahead of print.

Authors

Charlotte West¹, Conor R Walker^{1

2}, Shayesteh Arasti¹, Viacheslav Vasilev¹, Xingze Xu¹, Nicola De Maio¹, Nick Goldman¹

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, United Kingdom.
² Department of Genetics, University of Cambridge, Cambridge, CB2 3EH, United Kingdom.

PMID: 40587487
DOI: 10.1093/molbev/msaf154

Abstract

Traditional statistical methods using maximum likelihood and Bayesian inference can detect positive selection from an interspecific phylogeny and a codon sequence alignment based on model assumptions, but they are prone to false positives due to alignment errors and can lack power. These problems are particularly pronounced when faced with high levels of indels and divergence. To address these issues, we trained and tested convolutional neural network (CNN) models on simulated data and achieved higher accuracy in detecting selection across a specific range of phylogenetic scenarios and evolutionary modes. This advantage is particularly evident when performing inference on noisy data prone to misalignments. Our method shows some ability to account for these errors, where most statistical frameworks fail to do so in a tractable manner. We explore the generalisability of our CNN models to unseen evolutionary scenarios and identify future avenues to achieve broader utility. Once trained, our CNN model is faster at test time, making it a scalable alternative to traditional statistical methods for large-scale, multi-gene analyses. In addition to binary classification (inference of the presence or absence of positive selection during the evolution of the sequences), we use saliency maps to understand what the model learns and observe how this could be leveraged for sitewise inference of positive selection.