CheMLT-F: multitask learning in biochemistry through transformer fusion

计算机科学机器学习模块化设计工作流程人工智能化学空间判别式水准点（测量）编码器生物信息学变压器重新使用多任务学习数据挖掘标记数据忠诚训练集任务（项目管理）概化理论药物发现特征学习试验台分布式计算深度学习数据流挖掘架空（工程）任务分析数据建模单点故障标杆管理集合预报监督学习外推法合成数据

作者

Vladislav Mun,Siamac Fazli

标识

DOI：10.6084/m9.figshare.c.8501611.v1

摘要

Abstract Drug discovery remains a slow and costly process, in part because efficacy, toxicity, and physicochemical liabilities must be screened across a vast chemical space. Stand-alone, single-task predictors can help, but they lead to fragmented workflows and make it hard to reuse learned representations, data processing, and infrastructure across endpoints (i.e., prediction tasks). Here we present CheMLT-F, a compact multitask transformer that fuses encoders for molecular and protein sequences to learn a unified representation spanning 680+ endpoints, including diverse toxicities, physicochemical properties, and drug–target interactions. Across 13 public benchmarks, CheMLT-F matches state-of-the-art toxicity classifiers and sets new performance marks for physicochemical property prediction, while remaining competitive for drug–target affinity (KIBA and Davis). Moreover, CheMLT-F demonstrates competitive performance on an external protein-family benchmark spanning seven target superfamilies, indicating broad generalizability in bioactivity prediction. Multitask parameter sharing keeps the model lightweight and inference-efficient, and its modular heads make extensions to new endpoints straightforward. By replacing many individual models with a single, extensible backbone, CheMLT-F streamlines in silico screening and lowers the barrier to broad, data-driven decision-making in early drug discovery. Scientific contribution We introduce a unified transformer architecture that jointly models molecular and protein sequences across hundreds of pharmacologically relevant endpoints spanning toxicity, physicochemical properties, and drug–target interactions. A tailored training strategy that combines partial encoder freezing, global–local loss balancing, and weighted task sampling reduces trainable parameters and deployment complexity while preserving strong cross-domain generalization. Comprehensive evaluation across 13 public datasets, including scaffold-aware and random data splits, demonstrates competitive accuracy with substantially lower operational overhead than maintaining numerous single-task models, establishing a scalable foundation for extensible and holistic predictive modeling in computational drug discovery.

求助该文献

CheMLT-F: multitask learning in biochemistry through transformer fusion

今日热心研友