拟南芥
基因组
计算生物学
基因组学
生物
计算机科学
剪接
DNA测序
人工智能
遗传学
DNA
基因
突变体
作者
Jingjing Zhai,Aaron Gokaslan,Yair Schiff,Ana Berthel,Zong-Yan Liu,Wei‐Yun Lai,Zachary Miller,Armin Scheben,Michelle C. Stitzer,M. Cinta Romay,Edward S. Buckler,Volodymyr Kuleshov
标识
DOI:10.1073/pnas.2421738122
摘要
Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pretrained on large-scale biological sequences can capture evolutionary conservation and offer cross-species prediction better than supervised models through fine-tuning limited labeled data. We introduce PlantCaduceus, a plant DNA LM that learns evolutionary conservation patterns in 16 angiosperm genomes by modeling both DNA strands simultaneously. When fine-tuned on a small set of labeled Arabidopsis data for tasks such as predicting translation initiation/termination sites and splice donor/acceptor sites, PlantCaduceus demonstrated remarkable transferability to maize, which diverged 160 Mya. The model outperformed the best existing DNA language model by 1.45-fold in maize splice donor prediction and 7.23-fold in maize translation initiation site prediction. In variant effect prediction, PlantCaduceus showed performance comparative to state-of-the-art protein LMs. Mutations predicted to be deleterious by PlantCaduceus showed threefold lower average minor allele frequencies compared to those identified by multiple sequence alignment-based methods. Additionally, PlantCaduceus successfully identifies well-known causal variants in both Arabidopsis and maize. Overall, PlantCaduceus is a versatile DNA LM that can accelerate plant genomics and crop breeding applications.
科研通智能强力驱动
Strongly Powered by AbleSci AI