Mem-based pangenome indexing for k-mer queries

Stephen Hwang; Nathaniel K Brown; Omar Y Ahmed; Katharine M Jenike; Sam Kovaka; Michael C Schatz; Ben Langmead

doi:10.1186/s13015-025-00272-y

Mem-based pangenome indexing for k-mer queries

Algorithms Mol Biol. 2025 Mar 1;20(1):3. doi: 10.1186/s13015-025-00272-y.

Authors

Stephen Hwang¹, Nathaniel K Brown², Omar Y Ahmed², Katharine M Jenike², Sam Kovaka², Michael C Schatz², Ben Langmead³

Affiliations

¹ XDBio Program, Johns Hopkins University, Baltimore, MD, USA.
² Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
³ Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA. langmea@cs.jhu.edu.

Abstract

Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k-mers and de Bruijn graphs are limited to answering questions at a specific substring length k. We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k-mer presence/absence (membership queries) and that count the number of genomes containing k-mers in a window (conservation queries). MEMO's index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8 $\times$ smaller than a comparable KMC3 index and 11.4 $\times$ smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 s, 2.5 $\times$ faster than other approaches. MEMO's small index size, lack of k-mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.

Keywords: Comparative genomics; Compressed indexing; Pangenomics.

Abstract

Grants and funding