计算机科学
标杆管理
水准点(测量)
核糖核酸
一般化
机器学习
核酸二级结构
人工智能
数据科学
数据挖掘
理论计算机科学
生物
营销
业务
数学分析
生物化学
数学
大地测量学
基因
地理
作者
Luciano I Zablocki,Leandro A. Bugnon,M. Gérard,Leandro E. Di Persia,Georgina Stegmayer,Diego H. Milone
摘要
In recent years, inspired by the success of large language models (LLMs) for DNA and proteins, several LLMs for RNA have also been developed. These models take massive RNA datasets as inputs and learn, in a self-supervised way, how to represent each RNA base with a semantically rich numerical vector. This is done under the hypothesis that obtaining high-quality RNA representations can enhance data-costly downstream tasks, such as the fundamental RNA secondary structure prediction problem. However, existing RNA-LLM have not been evaluated for this task in a unified experimental setup. Since they are pretrained models, assessment of their generalization capabilities on new structures is a crucial aspect. Nonetheless, this has been just partially addressed in literature. In this work we present a comprehensive experimental and comparative analysis of pretrained RNA-LLM that have been recently proposed. We evaluate the use of these representations for the secondary structure prediction task with a common deep learning architecture. The RNA-LLM were assessed with increasing generalization difficulty on benchmark datasets. Results showed that two LLMs clearly outperform the other models, and revealed significant challenges for generalization in low-homology scenarios. Moreover, in this study we provide curated benchmark datasets of increasing complexity and a unified experimental setup for this scientific endeavor. Source code and curated benchmark datasets with increasing complexity are available in the repository: https://github.com/sinc-lab/rna-llm-folding/.
科研通智能强力驱动
Strongly Powered by AbleSci AI