To Impute or Not To Impute in Untargeted Metabolomics─That is the Compositional Question

Dennis D Krutkin; Sydney Thomas; Simone Zuffa; Prajit Rajkumar; Rob Knight; Pieter C Dorrestein; Scott T Kelley

doi:10.1021/jasms.4c00434

To Impute or Not To Impute in Untargeted Metabolomics─That is the Compositional Question

J Am Soc Mass Spectrom. 2025 Apr 2;36(4):742-759. doi: 10.1021/jasms.4c00434. Epub 2025 Feb 25.

Authors

Dennis D Krutkin^{1

2}, Sydney Thomas³, Simone Zuffa^{3

4}, Prajit Rajkumar³, Rob Knight^{5

6

7}, Pieter C Dorrestein^{3

4

6

8}, Scott T Kelley²

Affiliations

¹ School of Biological Sciences, University of California San Diego, La Jolla, California 92037, United States.
² Department of Biology, San Diego State University, San Diego, California 92182, United States.
³ Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California 92037, United States.
⁴ Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, California 92037, United States.
⁵ Department of Computer Science and Engineering, University of California San Diego, La Jolla, California 92037, United States.
⁶ Department of Pediatrics and Shu Chien-Gene Lay Department of Engineering, University of California San Diego, La Jolla, California 92037, United States.
⁷ Halıcıoǧlu Data Science Institute, University of California San Diego, La Jolla, California 92037, United States.
⁸ Center for Microbiome Innovation, University of California San Diego, La Jolla, California 92037, United States.

Abstract

Untargeted metabolomics often produce large datasets with missing values. These missing values are derived from biological or technical factors and can undermine statistical analyses and lead to biased biological interpretations. Imputation methods, such as k-Nearest Neighbors (kNN) and Random Forest (RF) regression, are commonly used, but their effects vary depending on the type of missing data, e.g., Missing Completely At Random (MCAR) and Missing Not At Random (MNAR). Here, we determined the impacts of degree and type of missing data on the accuracy of kNN and RF imputation using two datasets: a targeted metabolomic dataset with spiked-in standards and an untargeted metabolomic dataset. We also assessed the effect of compositional data approaches (CoDA), such as the centered log-ratio (CLR) transform, on data interpretation since these methods are increasingly being used in metabolomics. Overall, we found that kNN and RF performed more accurately when the proportion of missing data across samples for a metabolic feature was low. However, these imputations could not handle MNAR data and generated wildly inflated or imputed values where none should exist. Furthermore, we show that the proportion of missing values had a strong impact on the accuracy of imputation, which affected the interpretation of the results. Our results suggest imputation should be used with extreme caution even with modest levels of missing data and especially when the type of missingness is unknown.

MeSH terms

Data Interpretation, Statistical
Humans
Metabolomics* / methods