Risk factors and prediction of distant metastasis (DM) of colon adenocarcinoma: a logistic regression and machine learning study based on surveillance, epidemiology, and end results (SEER) database

BMC Cancer. 2025 Jul 1;25(1):1047. doi: 10.1186/s12885-025-14329-z.

Abstract

Background: Given the limitations of traditional imaging examinations to detect distant metastasis (DM) (e.g., low sensitivity), this study is to identify pathological and laboratory risk factors and establish models predicting distant metastasis of colon adenocarcinoma (CA) patients.

Methods: CA Patients diagnosed between the year of 2018 and 2021 were retrieved from SEER. Logistic regression was utilized to find independent risk factors (IRFs) of DM and 12 models including BNB (Bernoulli naïve bayes), DT (Decision tree), GBC (Gradient Boosting Classifier), GNB (Gaussian naïve bayes), KNN (K-nearest neighbor), LDA (Linear Discriminant Analysis), LR (Logistic regression), MLP (Multi-layer perceptron classifier), MNB (Multinomial naïve bayes), QDA (Quadratic discriminant analysis), RFC (Random forest classifier) and SVC (Support vector machine) were established and evaluated on the training set and test set (7:3) of the retrieved patients. Additionally, CA patient data was collected from Jincheng People’s Hospital (JCPH) as an external validation set for the prediction efficacy of the models.

Results: 7,000 and 83 CA patients were retrieved from SEER and JCPH respectively, and 8 IRFs including age 60–79 (OR = 0.589, 95% CI: 0.391–0.887) and age > 80 (OR = 0.456, 95% CI: 0.287–0.722), primary site – cecum (OR = 1.305, 95% CI: 1.023–1.664), TNM stage – T3 (OR = 8.869, 95% CI: 2.151–36.569) and T4 (OR = 15.912, 95% CI: 3.839–65.955), TNM stage – N1 (OR = 3.853, 95% CI: 2.919–5.087) and N2 (OR = 8.480, 95% CI: 6.322–11.374), number of regional nodes examined > 12 (OR = 0.439, 95% CI: 0.326–0.591), tumor deposits (OR = 1.989, 95% CI: 1.639–2.414), carcinoembryonic antigen (CEA) level (OR = 4.552, 95% CI: 3.747–5.530) and perineural invasion (OR = 1.352, 95% CI: 1.112–1.643) were identified. LR showed the best predictive efficacy both on the test (AUC = 0.892, sensitivity = 0.825, specificity = 0.801) and external validation set (AUC = 0.868, sensitivity = 1.000, specificity = 0.727).

Conclusions: Machine learning is a promising way to assist the detection of DM for CA patients.

Keywords: Colon adenocarcinoma; Distant metastasis; Machine learning; Risk factor.