Using a Supervised Principal Components Analysis for Variable Selection in High-Dimensional Datasets Reduces False Discovery Rates

Insha Ullah; Kerrie Mengersen; Anthony N Pettitt; Benoit Liquet

doi:10.1002/sim.70110

Using a Supervised Principal Components Analysis for Variable Selection in High-Dimensional Datasets Reduces False Discovery Rates

Stat Med. 2025 Jun;44(13-14):e70110. doi: 10.1002/sim.70110.

Authors

Insha Ullah¹, Kerrie Mengersen², Anthony N Pettitt², Benoit Liquet³

Affiliations

¹ Research School of Finance, Actuarial Studies & Statistics, ANU College of Business & Economics, The Australian National University, Canberra, Australian Capital Territory, Australia.
² School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia.
³ Laboratoire de Mathématiques et de leurs Applications, UMR CNRS 5142, Université de Pau et des Pays de l'Adour, E2S-UPPA, PAU, France.

PMID: 40459332
DOI: 10.1002/sim.70110

Abstract

High-dimensional datasets, where the number of variables ' $p$ ' is much larger than the number of samples ' $n$ ', are ubiquitous and often render standard classification techniques unreliable due to overfitting. An important research problem is feature selection, which ranks candidate variables based on their relevance to the outcome variable and retains those that satisfy a chosen criterion. This article proposes a computationally efficient variable selection method based on principal component analysis tailored to a binary classification problem or case-control study. This method is accessible and is suitable for the analysis of high-dimensional datasets. We demonstrate the superior performance of our method through extensive simulations. A semi-real gene expression dataset, a challenging childhood acute lymphoblastic leukemia gene expression study, and a GWAS that attempts to identify single-nucleotide polymorphisms (SNPs) associated with rice grain length further demonstrate the usefulness of our method in genomic applications. We expect our method to accurately identify important features and reduce the False Discovery Rate (fdr) by accounting for the correlation between variables and by de-noising data in the training phase, which also makes it robust to mild outliers in the training data. Our method is almost as fast as univariate filters, so it allows valid statistical inference. The ability to make such inferences sets this method apart from most current multivariate statistical tools designed for today's high-dimensional data.

Keywords: GWAS; QTL mapping; binary classification; case‐control study; feature selection; gene‐expression data.

MeSH terms

Case-Control Studies
Child
Computer Simulation
Genome-Wide Association Study / statistics & numerical data
Humans
Oryza / genetics
Polymorphism, Single Nucleotide
Precursor Cell Lymphoblastic Leukemia-Lymphoma / genetics
Principal Component Analysis* / methods

Grants and funding

Australian Research Council (ARC) Centre of Excellence for Mathematical and Statistical Frontiers