Facilitate integrated analysis of single cell multiomic data by binarizing gene expression values

Nat Commun. 2025 Jul 1;16(1):5763. doi: 10.1038/s41467-025-60899-8.

Abstract

A cell type's identity can be revealed by its transcriptome and epigenome profiles, both of which can be in flux temporally and spatially, leading to distinct cell states or subtypes. The popular and standard workflow for single cell RNA-seq (scRNA-seq) data analysis applies feature selection, dimensional reduction, and clustering on the gene expression values quantified by read counts, but alternative approaches using a simple classification of a gene to "on" and "off" (i.e., binarization of the gene expression) have been proposed for clustering cells and other downstream analyses. Here, we demonstrate that a direct concatenation of the binarized scRNA-seq data and the standard single cell ATAC-seq data is sufficient and effective for vertical integrated clustering analysis, after applying term-frequency-inverse document frequency (TF-IDF) and single value decomposition (also called latent semantic indexing, LSI) algorithms to the combined data, when the two data modalities are collected using a paired multiomic technology. This proposed approach avoids the need for converting scATAC-seq data to gene activity scores for combined analysis. Furthermore it enables a direct investigation into the contribution of each data type for resolving cell type or subtype identity.

MeSH terms

  • Algorithms
  • Animals
  • Cluster Analysis
  • Gene Expression Profiling* / methods
  • Humans
  • Mice
  • RNA-Seq / methods
  • Sequence Analysis, RNA / methods
  • Single-Cell Analysis* / methods
  • Transcriptome* / genetics