Outlier identification method based on multi-model weighted consensus in conjunction with Monte Carlo Cross-Validation

Yujing Wang; Zhengguang Chen; Jinming Liu; He Wang

doi:10.1093/jaoacint/qsaf061

Outlier identification method based on multi-model weighted consensus in conjunction with Monte Carlo Cross-Validation

J AOAC Int. 2025 Jun 23:qsaf061. doi: 10.1093/jaoacint/qsaf061. Online ahead of print.

Authors

Yujing Wang¹, Zhengguang Chen¹, Jinming Liu¹, He Wang¹

Affiliation

¹ College of Information and Electrical Engineering, Heilongjiang Bayi Agricultural University, Daqing, 163319 China.

PMID: 40581617
DOI: 10.1093/jaoacint/qsaf061

Abstract

Background: The accurate identification and removal of outliers are fundamental to the development of a robust model.

Objective: Nevertheless, relying solely on a single model for outlier identification may prove inadequate for accurately identifying all outliers, potentially leading to false positives, false negatives, and model dependence.

Methods: This study introduces a method termed Monte Carlo cross-validation in conjunction with multiple models of Weighted Consensus for outlier identification (MCWC). The proposed method integrates Monte Carlo random sampling with three distinct modeling methods: Partial Least Squares Regression (PLSR), Gaussian Process Regression (GPR), and Support Vector Regression (SVR). This integration allows for the amalgamation of predictions from each model, facilitating the identification of outliers effectively.

Results: This study employed a dataset comprising 305 sorghum samples as the experimental foundation. The predictive model for sorghum protein was built using the data after removing outliers using the single model method and the MCWC method, respectively. The experimental results indicate that the dataset, which was obtained by removing outliers using a single modeling method, is appropriate for further modeling with the same method. However, it is not suitable for modeling with other methods due to issues related to model dependence. After applying the MCWC method to remove outliers, the average R2 of the model prediction set was found to be 0.8525. In contrast, the average R2 of the model prediction set, obtained by applying the Monte Carlo method combined exclusively with PLSR for outlier removal, is 0.8037.

Conclusion: The MCWC method exhibits superior accuracy in identifying outliers and effectively addresses challenges such as false positive, false negative, and model dependence in the process of identifying near-infrared spectral outliers. This enhances the overall predictive performance of the calibration model for spectral quantitative analysis.

Highlights: A multi-model dynamic weighted consensus outlier identification for NIRS data was proposed. This dynamic weighting method effectively addresses the biases that can occur with simple averaging. The data after removing outliers using consensus methods is more suitable for modeling with a wider range of models.

© The Author(s) 2025. Published by Oxford University Press on behalf of AOAC INTERNATIONAL. All rights reserved. For commercial re-use, please contact reprints@oup.com for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information please contact journals.permissions@oup.com.