Accurate prediction of energy consumption is crucial for optimizing wastewater treatment plant (WWTP) operations. However, imbalanced data caused by variable influent conditions often compromises machine learning (ML) model accuracy. This study proposes a novel ML framework to address the imbalanced regression problem using three temporal difference-weighted resampling (TDWR) methods: Threshold under-sampling (TUS), Stochastic under-sampling (SUS), and Inverse histogram under-sampling (IHS). Internal validation used an 80/20 training/testing split within each dataset, and external validation involved cross-testing among different resampled and original datasets to ensure robust assessment. Among the methods, SUS with a sampling factor of 6 (SUS-6) achieved the best performance. When combined with XGBoost, it attained an R2 of 0.9998, an RMSE of 0.0833, and a MAPE of 0.14 %. Compared to the original data, R2 was improved by up to 27.6 %, RMSE was reduced by nearly 87 %, and MAPE was reduced by 96.07 %. The 95 % confidence interval of residuals narrowed to (-1.24, 1.25), shrinking by approximately 70 %. Similar improvements were observed across support vector regression (84 % narrower), artificial neural network (45 %), and random forest (63 %) models. SHAP (SHapley Additive exPlanations)-based interpretability analysis revealed that aeration-related features such as BOD, COD, and NH3-N were the main contributors to energy consumption, providing practical guidance for process optimization. Overall, the proposed TDWR framework enhances both prediction accuracy and interpretability, offering an effective tool for intelligent, low-carbon energy management in WWTPs.
Keywords: Data imbalance; Energy consumption; Machine learning; Temporal difference-weighted resampling; Wastewater treatment plants.
Copyright © 2025 Elsevier Ltd. All rights reserved.