Comparative investigation of bagging enhanced machine learning for early detection of HCV infections using class imbalance technique with feature selection

Ekramul Haque Tusher; Mohd Arfian Ismail; Abdullah Akib; Lubna A Gabralla; Ashraf Osman Ibrahim; Hafizan Mat Som; Muhammad Akmal Remli

doi:10.1371/journal.pone.0326488

Comparative investigation of bagging enhanced machine learning for early detection of HCV infections using class imbalance technique with feature selection

PLoS One. 2025 Jun 26;20(6):e0326488. doi: 10.1371/journal.pone.0326488. eCollection 2025.

Authors

Ekramul Haque Tusher¹, Mohd Arfian Ismail^{1

2}, Abdullah Akib³, Lubna A Gabralla⁴, Ashraf Osman Ibrahim^{5

6}, Hafizan Mat Som^{5

6}, Muhammad Akmal Remli^{7

8}

Affiliations

¹ Faculty of Computing, Universiti Malaysia Pahang Al-Sultan Abdullah, Pahang, Malaysia.
² Center of Excellence for Artificial Intelligence & Data Science, Universiti Malaysia Pahang Al-Sultan Abdullah, Lebuhraya Tun Razak, Gambang, Malaysia.
³ Industrial Engineering, Lamar University, Beaumont, Texas, United States of America.
⁴ Department of Computer Science, Applied College, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.
⁵ Department of Computing, Universiti Teknologi PETRONAS, Seri Iskandar, Malaysia.
⁶ Positive Computing Research Center, Emerging & Digital Technologies Institute, Universiti Teknologi PETRONAS, Seri Iskandar, Malaysia.
⁷ Institute for Artificial Intelligence and Big Data, Universiti Malaysia Kelantan, Kota Bharu, Kelantan, Malaysia.
⁸ Faculty of Data Science and Computing, Universiti Malaysia Kelantan, Kota Bharu, Kelantan, Malaysia.

Abstract

Around 1.5 million new cases of Hepatitis C Virus (HCV) are diagnosed globally each year (World Health Organization, 2023). Consequently, there is a pressing need for early diagnostic methods for HCV. This study investigates the prognostic accuracy of several ensemble machine learning (ML) models for diagnosing HCV infection. The study utilizes a dataset comprising demographic information of 615 individuals suspected of having HCV infection. Additionally, the research employs oversampling and undersampling techniques to address class imbalances in the dataset and conducts feature reduction using the F-test in one-way analysis of variance. Ensemble ML methods, including Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), Logistic Regression (LR), Random Forest (RF), Naïve Bayes (NB), and Decision Tree (DT), are used to predict HCV infection. The performance of these ensemble methods is evaluated using metrics such as accuracy, recall, precision, F1 score, G-mean, balanced accuracy, cross-validation (CV), area under the curve (AUC), standard deviation, and error rate. Compared with previous studies, the Bagging k-NN model demonstrated superior performance under oversampling conditions, achieving 98.37% accuracy, 98.23% CV score, 97.67% precision, 97.93% recall, 98.18% selectivity, 97.79% F1 score, 98.06% balanced accuracy, 98.05% G-mean, a 1.63% error rate, 0.98 AUC, and a standard deviation of 0.192. This study highlights the potential of ensemble ML approaches in improving the diagnosis of HCV. The findings provide a foundation for developing accurate predictive methods for HCV diagnosis.

Copyright: © 2025 Tusher et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Publication types

Comparative Study

MeSH terms

Adult
Area Under Curve
Bayes Theorem
Decision Trees
Early Diagnosis
Female
Hepacivirus
Hepatitis C* / diagnosis
Humans
Machine Learning*
Male
Middle Aged
Support Vector Machine