密码子使用偏好性
计算生物学
编码
遗传密码
编码区
起始密码子
生物
序列(生物学)
信使核糖核酸
遗传学
计算机科学
基因组
基因
作者
Marjan Faizi,Helen Sakharova,Liana F. Lareau
标识
DOI:10.1101/2025.05.13.653614
摘要
Abstract The genetic code allows multiple synonymous codons to encode the same amino acid, creating a vast sequence space for protein-coding regions. Codon choice can impact mRNA function and protein output, a consideration newly relevant with advances in mRNA technology. Genomes preferentially use some codons, but simple optimization methods that select preferred codons miss complex contextual patterns. We present Trias, an encoder-decoder language model trained on millions of eukaryotic coding sequences. Trias learns codon usage rules directly from sequence data, integrating local and global dependencies to generate species-specific codon sequences that align with biological constraints. Without explicit training on protein expression, Trias generates sequences and scores that correlate strongly with experimental measurements of mRNA stability, ribosome load, and protein output. The model outperforms commercial codon optimization tools in generating sequences resembling high-expression codon sequence variants. By modeling codon usage in context, Trias offers a data-driven framework for synthetic mRNA design and for understanding the molecular and evolutionary principles behind codon choice.
科研通智能强力驱动
Strongly Powered by AbleSci AI