Long-read transcriptomics of a diverse human cohort reveals widespread ancestry bias in gene annotation

bioRxiv [Preprint]. 2025 Mar 17:2025.03.14.643250. doi: 10.1101/2025.03.14.643250.

Abstract

Accurate gene annotations are fundamental for interpreting genetic variation, cellular function, and disease mechanisms. However, current human gene annotations are largely derived from transcriptomic data of individuals with European ancestry, introducing potential biases that remain uncharacterized. Here, we generate over 800 million full-length reads with long-read RNA-seq in 43 lymphoblastoid cell line samples from eight genetically-diverse human populations and build a cross-ancestry gene annotation. We show that transcripts from non-European samples are underrepresented in reference gene annotations, leading to systematic biases in allele-specific transcript usage analyses. Furthermore, we show that personal genome assemblies enhance transcript discovery compared to the generic GRCh38 reference assembly, even though genomic regions unique to each individual are heavily depleted of genes. These findings underscore the urgent need for a more inclusive gene annotation framework that accurately represents global transcriptome diversity.

Publication types

  • Preprint