Improving Lung Cancer Risk Prediction Using Machine Learning: A Comparative Analysis of Stacking Models and Traditional Approaches

Huakang Tu; Yunfeng Zhao; Jiameng Cui; Wanzhu Lu; Gege Sun; Xiaohang Xu; Qingfeng Hu; Kejia Hu; Ming Wu; Xifeng Wu

doi:10.3390/cancers17101651

Improving Lung Cancer Risk Prediction Using Machine Learning: A Comparative Analysis of Stacking Models and Traditional Approaches

Cancers (Basel). 2025 May 13;17(10):1651. doi: 10.3390/cancers17101651.

Authors

Huakang Tu¹, Yunfeng Zhao¹, Jiameng Cui¹, Wanzhu Lu¹, Gege Sun¹, Xiaohang Xu¹, Qingfeng Hu¹, Kejia Hu¹, Ming Wu², Xifeng Wu^{1

3

4}

Affiliations

¹ Center of Clinical Big Data and Analytics of the Second Affiliated Hospital and School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058, China.
² Department of Thoracic Surgery, The Second Affiliated Hospital, Zhejiang University School of Medicine, 88 Jiefang Rd., Hangzhou 310009, China.
³ National Institute for Data Science in Health and Medicine, Zhejiang University, Hangzhou 310058, China.
⁴ Zhejiang Key Laboratory of Intelligent Preventive Medicine, Hangzhou 310058, China.

Abstract

Background: Lung cancer is a leading cause of cancer-related mortality worldwide, often diagnosed in advanced stages, making early detection critical. This study aimed to evaluate the performance of various machine learning models in predicting lung cancer risk based on epidemiological questionnaires, comparing them with traditional logistic regression models.

Methods: A retrospective case-control study was conducted using data from 5421 lung cancer cases and 10,831 matched controls. The dataset included a wide range of demographic, clinical, and behavioral risk factors from epidemiological questionnaires. We developed and compared multiple machine learning algorithms, including LightGBM and stacking ensemble models, alongside logistic regression for predicting lung cancer risk. Model performance was evaluated using accuracy, area under the curve (AUC), and recall.

Results: The stacking model outperformed traditional logistic regression, achieving an AUC of 0.887 (0.870-0.903) compared to 0.858 (0.839-0.878) for logistic regression. LightGBM also performed well, with an AUC of 0.884 (0.867-0.901). The stacking model achieved an accuracy of 81.2%, with a recall of 0.755, higher than the logistic regression model's accuracy of 79.4%. Compared to classical lung cancer prediction models (LLP and PLCO), the logistic regression and ML models improved AUC by 12% to 27%.

Conclusions: Integrating machine learning models into lung cancer screening programs can significantly enhance early detection efforts. Machine learning approaches, such as LightGBM and stacking, offer improved accuracy and predictive power over traditional models. However, efforts to enhance model interpretability through explainable AI techniques are necessary for broader clinical adoption.

Keywords: artificial intelligence; lung cancer; machine learning; prediction; stacking models.

Abstract

Grants and funding