SyntheVAEiser: augmenting traditional machine learning methods with VAE-based gene expression sample generation for improved cancer subtype predictions

Brian Karlberg; Raphael Kirchgaessner; Jordan Lee; Matthew Peterkort; Liam Beckman; Jeremy Goecks; Kyle Ellrott

doi:10.1186/s13059-024-03431-3

SyntheVAEiser: augmenting traditional machine learning methods with VAE-based gene expression sample generation for improved cancer subtype predictions

Genome Biol. 2024 Dec 18;25(1):309. doi: 10.1186/s13059-024-03431-3.

Authors

Brian Karlberg¹, Raphael Kirchgaessner¹, Jordan Lee¹, Matthew Peterkort¹, Liam Beckman¹, Jeremy Goecks^{1

2}, Kyle Ellrott³

Affiliations

¹ Biomedical Engineering, Oregon Health and Science University, 3181 S.W. Sam Jackson Park Road, Portland, OR, 97239-3098, USA.
² Department of Machine Learning, Moffitt Cancer Center, Tampa, USA.
³ Biomedical Engineering, Oregon Health and Science University, 3181 S.W. Sam Jackson Park Road, Portland, OR, 97239-3098, USA. ellrott@ohsu.edu.

Abstract

The accuracy of machine learning methods is often limited by the amount of training data that is available. We proposed to improve machine learning training regimes by augmenting datasets with synthetically generated samples. We present a method for synthesizing gene expression samples and test the system's capabilities for improving the accuracy of categorical prediction of cancer subtypes. We developed SyntheVAEiser, a variational autoencoder based tool that was trained and tested on over 8000 cancer samples. We have shown that this technique can be used to augment machine learning tasks and increase performance of recognition of underrepresented cohorts.

Keywords: Cancer subtyping; Data augmentation; Feature engineering; Gene expression; Generative modeling; Molecular subtyping; Sample synthesis; Synthetic data; Transcriptomics; Variational autoencoder.

MeSH terms

Algorithms
Gene Expression Profiling / methods
Gene Expression Regulation, Neoplastic
Humans
Machine Learning*
Neoplasms* / genetics
Software

Abstract

MeSH terms

Grants and funding