Consistently processed RNA sequencing data from 50 sources enriched for pediatric data

Holly C Beale; Katrina Learned; Ellen T Kephart; A Geoffrey Lyle; Anouk van den Bout; Molly McCabe; Kathryn Echandia-Monroe; Mansi J Khare; Elise Y Huang; Sneha Jariwala; Reyna Antilla; Allison Cheney; Alex G Lee; Leanne C Sayles; Stanley G Leung; Yvonne A Vasquez; Lauren Sanders; David Haussler; Sofie R Salama; E Alejandro Sweet-Cordero; Olena M Vaske

doi:10.1038/s41597-025-05376-z

Consistently processed RNA sequencing data from 50 sources enriched for pediatric data

Sci Data. 2025 Jul 2;12(1):1134. doi: 10.1038/s41597-025-05376-z.

Authors

Holly C Beale^#^{1

2}, Katrina Learned^#³, Ellen T Kephart^#³, A Geoffrey Lyle^{4

3}, Anouk van den Bout^{4

3}, Molly McCabe^{4

3}, Kathryn Echandia-Monroe^{4

3}, Mansi J Khare^{4

3}, Elise Y Huang^{4

3}, Sneha Jariwala^{4

3}, Reyna Antilla^{4

3}, Allison Cheney^{4

3}, Alex G Lee⁵, Leanne C Sayles⁵, Stanley G Leung⁶, Yvonne A Vasquez^{4

3}, Lauren Sanders^{4

3}, David Haussler^{3

7}, Sofie R Salama^{4

3}, E Alejandro Sweet-Cordero⁵, Olena M Vaske^{8

9}

Affiliations

¹ Department of Molecular, Cell and Developmental Biology, University of California Santa Cruz, Santa Cruz, California, USA. hcbeale@ucsc.edu.
² Genomics Institute, University of California Santa Cruz, Santa Cruz, California, USA. hcbeale@ucsc.edu.
³ Genomics Institute, University of California Santa Cruz, Santa Cruz, California, USA.
⁴ Department of Molecular, Cell and Developmental Biology, University of California Santa Cruz, Santa Cruz, California, USA.
⁵ Division of Pediatric Oncology, University of California San Francisco, San Francisco, California, USA.
⁶ Division of Radiation Oncology, University of California San Francisco, San Francisco, California, USA.
⁷ Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, California, USA.
⁸ Department of Molecular, Cell and Developmental Biology, University of California Santa Cruz, Santa Cruz, California, USA. olena@ucsc.edu.
⁹ Genomics Institute, University of California Santa Cruz, Santa Cruz, California, USA. olena@ucsc.edu.

^# Contributed equally.

Abstract

Larger cohorts improve the power of tumor gene expression analysis, but the signal is muddied if datasets are processed using different methods or have inaccurate metadata. Here we present five compendia containing consistently processed gene expression data derived from 16,446 diverse RNA sequencing datasets. To create the compendia, we obtained access to RNA sequence data from repositories containing public data as well as clinical partners with access to non-published data. We then assessed the quality, quantified gene expression, harmonized clinical metadata, and released the expression values and metadata without access restrictions. These datasets have been used for diverse projects ranging from identifying similarities between tumor types to assessing how well cell lines recapitulate tumors. They have also been used for n-of-1 analysis to identify genes with unusual expression patterns in a single sample and to infer molecular diagnosis. The comparison to new data is enabled by our dockerized, freely available pipeline. The compendia have been cited in at least 20 publications.

Publication types

Dataset

MeSH terms

Child
Humans
Metadata
Neoplasms* / genetics
Sequence Analysis, RNA*

Abstract

Publication types

MeSH terms

Grants and funding