计算机科学
代表(政治)
水准点(测量)
生物医学
人工智能
领域(数学)
机器学习
功能(生物学)
蛋白质结构预测
标杆管理
深度学习
蛋白质结构
生物信息学
生物
数学
生物化学
地理
业务
法学
纯数学
政治学
政治
进化生物学
营销
大地测量学
作者
Serbülent Ünsal,Heval Ataş,Muammer Albayrak,Kemal Turhan,Aybar C. Acar,Tunca Doğan
标识
DOI:10.1038/s42256-022-00457-9
摘要
Data-centric approaches have been used to develop predictive methods for elucidating uncharacterized properties of proteins; however, studies indicate that these methods should be further improved to effectively solve critical problems in biomedicine and biotechnology, which can be achieved by better representing the data at hand. Novel data representation approaches mostly take inspiration from language models that have yielded ground-breaking improvements in natural language processing. Lately, these approaches have been applied to the field of protein science and have displayed highly promising results in terms of extracting complex sequence–structure–function relationships. In this study we conducted a detailed investigation over protein representation learning by first categorizing/explaining each approach, subsequently benchmarking their performances on predicting: (1) semantic similarities between proteins, (2) ontology-based protein functions, (3) drug target protein families and (4) protein–protein binding affinity changes following mutations. We evaluate and discuss the advantages and disadvantages of each method over the benchmark results, source datasets and algorithms used, in comparison with classical model-driven approaches. Finally, we discuss current challenges and suggest future directions. We believe that the conclusions of this study will help researchers to apply machine/deep learning-based representation techniques to protein data for various predictive tasks, and inspire the development of novel methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI