Alignment-free viral sequence classification at scale

BMC Genomics. 2025 Apr 18;26(1):389. doi: 10.1186/s12864-025-11554-5.

Abstract

Background: The rapid increase in nucleotide sequence data generated by next-generation sequencing (NGS) technologies demands efficient computational tools for sequence comparison. Alignment-free (AF) methods offer a scalable alternative to traditional alignment-based approaches such as BLAST. This study evaluates alignment-free methods as scalable and rapid alternatives for viral sequence classification, focusing on identifying techniques that maintain high accuracy and efficiency when applied to extremely large datasets.

Results: We employed six established AF techniques to extract feature vectors from viral genomes, which were subsequently used to train Random Forest classifiers. Our primary dataset comprises 297,186 SARS-CoV- 2 nucleotide sequences, categorized into 3502 distinct lineages. Furthermore, we validated our models using dengue and HIV sequences to demonstrate robustness across different viral datasets. Our AF classifiers achieved 97.8% accuracy on the SARS-CoV- 2 test set, and 99.8% and 89.1% accuracy on dengue and HIV test sets, respectively.

Conclusion: Despite the high-class dimensionality, we show that word-based AF methods effectively represent viral sequences. Our study highlights the practical advantages of AF techniques, including significantly faster processing compared to alignment-based methods and the ability to classify sequences using modest computational resources.

Keywords: Alignment-free; Biological sequences; Feature extraction; Machine learning; Virus classification.

MeSH terms

  • Algorithms
  • COVID-19 / virology
  • Computational Biology* / methods
  • Dengue Virus / genetics
  • Genome, Viral*
  • HIV / genetics
  • High-Throughput Nucleotide Sequencing / methods
  • Humans
  • SARS-CoV-2* / classification
  • SARS-CoV-2* / genetics
  • Sequence Alignment