Larger cohorts improve the power of tumor gene expression analysis, but the signal is muddied if datasets are processed using different methods or have inaccurate metadata. Here we present five compendia containing consistently processed gene expression data derived from 16,446 diverse RNA sequencing datasets. To create the compendia, we obtained access to RNA sequence data from repositories containing public data as well as clinical partners with access to non-published data. We then assessed the quality, quantified gene expression, harmonized clinical metadata, and released the expression values and metadata without access restrictions. These datasets have been used for diverse projects ranging from identifying similarities between tumor types to assessing how well cell lines recapitulate tumors. They have also been used for n-of-1 analysis to identify genes with unusual expression patterns in a single sample and to infer molecular diagnosis. The comparison to new data is enabled by our dockerized, freely available pipeline. The compendia have been cited in at least 20 publications.
© 2025. The Author(s).