A practical evaluation of AutoML tools for binary, multiclass, and multilabel classification

Marcelo V C Aragão; Augusto G Afonso; Rafaela C Ferraz; Rairon G Ferreira; Sávio G Leite; Felipe A P de Figueiredo; Samuel B Mafra

doi:10.1038/s41598-025-02149-x

A practical evaluation of AutoML tools for binary, multiclass, and multilabel classification

Sci Rep. 2025 May 21;15(1):17682. doi: 10.1038/s41598-025-02149-x.

Authors

Marcelo V C Aragão^#¹, Augusto G Afonso^#², Rafaela C Ferraz^#², Rairon G Ferreira^#², Sávio G Leite^#², Felipe A P de Figueiredo², Samuel B Mafra²

Affiliations

¹ National Institute of Telecommunications (Inatel), Santa Rita do Sapucaí, MG, 37536-001, Brazil. marcelovca90@inatel.br.
² National Institute of Telecommunications (Inatel), Santa Rita do Sapucaí, MG, 37536-001, Brazil.

^# Contributed equally.

Abstract

Selecting the most suitable Automated Machine Learning (AutoML) tool is pivotal for achieving optimal performance in diverse classification tasks, including binary, multiclass, and multilabel scenarios. The wide range of frameworks with distinct features and capabilities complicates this decision, necessitating a systematic evaluation. This study benchmarks sixteen AutoML tools, including AutoGluon, AutoSklearn, TPOT, PyCaret, and Lightwood, across all three classification types using 21 real-world datasets. Unlike prior studies focusing on a subset of classification tasks or a limited number of tools, we provide a unified evaluation of sixteen frameworks, incorporating feature-based comparisons, time-constrained experiments, and multi-tier statistical validation. We also compared our findings with four representative prior benchmarks to contextualize our results within the existing literature. A key contribution of our study is the in-depth assessment of multilabel classification, exploring both native and label-powerset representations and revealing that several tools lack robust multilabel capabilities. Our findings demonstrate that AutoSklearn excels in predictive performance for binary and multiclass settings, albeit at longer training times, while Lightwood and AutoKeras offer faster training at the cost of predictive performance on complex datasets. AutoGluon emerges as the best overall solution, balancing predictive accuracy with computational efficiency. Our statistical analysis-at per-dataset, across-datasets, and all-datasets levels-confirms significant performance differences among tools, highlighting accuracy-speed trade-offs in AutoML. These insights underscore the importance of aligning tool selection with specific problem characteristics and resource constraints. The open-source code and reproducible experimental protocols further ensure the study's value as a robust resource for researchers and practitioners.

Keywords: AutoML; Classification; Hyperparameter optimization; Machine learning; Neural architecture search.

Abstract

Grants and funding