A generative language model decodes contextual constraints on codon choice for mRNA design

bioRxiv [Preprint]. 2025 Jun 6:2025.05.13.653614. doi: 10.1101/2025.05.13.653614.

Abstract

The genetic code allows multiple synonymous codons to encode the same amino acid, creating a vast sequence space for protein-coding regions. Codon choice can impact mRNA function and protein output, a consideration newly relevant with advances in mRNA technology. Genomes preferentially use some codons, but simple optimization methods that select preferred codons miss complex contextual patterns. We present Trias, an encoder-decoder language model trained on millions of eukaryotic coding sequences. Trias learns codon usage rules directly from sequence data, integrating local and global dependencies to generate species-specific codon sequences that align with biological constraints. Without explicit training on protein expression, Trias generates sequences and scores that correlate strongly with experimental measurements of mRNA stability, ribosome load, and protein output. The model outperforms commercial codon optimization tools in generating sequences resembling high-expression codon sequence variants. By modeling codon usage in context, Trias offers a data-driven framework for synthetic mRNA design and for understanding the molecular and evolutionary principles behind codon choice.

Publication types

  • Preprint