Accelerated prediction of molecular properties for per- and polyfluoroalkyl substances using graph neural networks with adjacency-free message passing

Environ Pollut. 2025 Jun 30:382:126705. doi: 10.1016/j.envpol.2025.126705. Online ahead of print.

Abstract

The molecular contaminant chemical space is vast, necessitating the development of methods and tools to accelerate the computation of molecular properties, study interactions, and ultimately aid to the engineering of technological solutions for environmental remediation and exposome reduction. Graph neural networks (GNNs) offer a promising approach due to their structural similarity to molecular graphs and their ability to learn complex relationships through graph-based structures. However, GNN-based model training can be computationally expensive, especially for large molecular datasets. In this work, we evaluated the predictive performance of a novel Graph-Enhanced multilayer perceptron (GE-MLP) on molecular properties of per- and polyfluoroalkyl substances (PFAS), and compared it against the performances of two traditional GNN-based architectures, namely Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT). The GE-MLP architecture, which incorporates structural information into a dense neural network framework, was trained on and validated on a dataset of 15,000 PFAS, generated using tight-binding methods, and calibrated against experimental results. The targeted properties were electron affinity (EA), ionization potential (IP), and HOMO-LUMO gap (HL). In contrast to traditional graph-based architectures, GE-MLP offers the advantages of processing molecular fingerprints and node-level descriptors in a purely feedforward manner, embedding structural information using molecular fingerprints and node-level descriptors in place of adjacency-based message passing. Our findings reinforce the usefulness of graph-based architectures in predicting molecular properties of complex contaminants such as PFAS, as compared against traditional machine learning (ML) models. Furthermore, the GE-MLP emerged as a strong GNN-based contender, demonstrating the highest predictive performance for IP, suggesting that integrating structural information via atomic and fingerprint based molecular descriptors into dense neural networks offers a viable alternative to adjacency-based message passing methods. Finally, our GE-MLP provides a computationally efficient alternative to other GNN-based methods due to savings in model training, offering a scalable, message-passing-free approach to molecular property prediction while retaining structural awareness. Future work includes the expansion of the data set to 3.5 million fluorinated compounds to improve generalization, as well as architectural improvements that include transfer learning, topological embeddings, and hybrid models to further advance predictive accuracy and applicability.

Keywords: Accelerated predictions; Adjacency-free; Contaminants; GNN; Molecular properties; PFAS.