计算机科学
推论
人工智能
背景(考古学)
自然语言处理
任务(项目管理)
语言理解
语言模型
因果推理
泄漏(经济)
机器学习
训练集
利用
序列(生物学)
语音识别
合成数据
实验数据
作者
Joseph Szymborski,Amin Emad
标识
DOI:10.1038/s42256-025-01176-7
摘要
With the growing pervasiveness of pretrained protein language models (pLMs), pLM-based methods are increasingly being put forward for the protein–protein interaction (PPI) inference task. Here we identify and confirm that existing pretrained pLMs are a source of data leakage for the downstream PPI task. We characterize the extent of the data leakage problem by training and comparing small and efficient pLMs on a dataset that controls for data leakage (strict) with one that does not (non-strict). Although data leakage from pretrained pLMs cause a measurable inflation of testing scores, we find that this does not necessarily extend to other, non-paired biological tasks such as protein keyword annotation. Further, we find no connection between the context lengths of pLMs and the performance of pLM-based PPI inference methods on proteins with sequence lengths that surpass it. Furthermore, we show that pLM-based and non-pLM-based models fail to generalize in tasks such as prediction of the human-SARS-CoV-2 PPIs or the effect of point mutations on binding affinities. This study demonstrates the importance of extending existing protocols for the evaluation of pLM-based models applied to paired biological datasets and identifies areas of weakness of current pLM models. The usage of pretrained protein language models (pLMs) is rapidly growing. However, Szymborski and Emad find that pretrained pLMs can be a source of data leakage in the task of protein–protein interaction inference, showing inflated performance scores.
科研通智能强力驱动
Strongly Powered by AbleSci AI