Reference genome and annotation updates lead to contradictory prognostic predictions in gene expression signatures: a case study of resected stage I lung adenocarcinoma

Brief Bioinform. 2021 May 20;22(3):bbaa081. doi: 10.1093/bib/bbaa081.

Abstract

RNA-sequencing enables accurate and low-cost transcriptome-wide detection. However, expression estimates vary as reference genomes and gene annotations are updated, confounding existing expression-based prognostic signatures. Herein, prognostic 9-gene pair signature (GPS) was applied to 197 patients with stage I lung adenocarcinoma derived from previous and latest data from The Cancer Genome Atlas (TCGA) processed with different reference genomes and annotations. For 9-GPS, 6.6% of patients exhibited discordant risk classifications between the two TCGA versions. Similar results were observed for other prognostic signatures, including IRGPI, 15-gene and ORACLE. We found that conflicting annotations for gene length and overlap were the major cause of their discordant risk classification. Therefore, we constructed a prognostic 40-GPS based on stable genes across GENCODE v20-v30 and validated it using public data of 471 stage I samples (log-rank P < 0.0010). Risk classification was still stable in RNA-sequencing data processed with the newest GENCODE v32 versus GENCODE v20-v30. Specifically, 40-GPS could predict survival for 30 stage I samples with formalin-fixed paraffin-embedded tissues (log-rank P = 0.0177). In conclusion, this method overcomes the vulnerability of existing prognostic signatures due to reference genome and annotation updates. 40-GPS may offer individualized clinical applications due to its prognostic accuracy and classification stability.

Keywords: RNA-sequencing; conflicting annotation; discordant risk classification; lung adenocarcinoma; relative expression order.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Adenocarcinoma / genetics
  • Adenocarcinoma / pathology*
  • Adenocarcinoma / surgery
  • Formaldehyde
  • Gene Expression Profiling*
  • Humans
  • Kaplan-Meier Estimate
  • Lung Neoplasms / genetics
  • Lung Neoplasms / pathology*
  • Lung Neoplasms / surgery
  • Molecular Sequence Annotation
  • Paraffin Embedding
  • Prognosis
  • Sequence Analysis, RNA / methods
  • Tissue Fixation
  • Transcriptome

Substances

  • Formaldehyde