Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration

计算机科学 人工智能 多任务学习 机器学习 特征学习 深度学习 利用 一般化 特征工程 自编码 特征(语言学) 编码器 财产(哲学) 任务(项目管理) 哲学 经济 数学分析 操作系统 管理 认识论 语言学 计算机安全 数学
作者
Xiaochen Zhang,Chengkun Wu,Jiacai Yi,Xiangxiang Zeng,Canqun Yang,Aiping Lü,Tingjun Hou,Dongsheng Cao
出处
期刊:Research [American Association for the Advancement of Science]
卷期号:2022 被引量:26
标识
DOI:10.34133/research.0004
摘要

Accurate prediction of pharmacological properties of small molecules is becoming increasingly important in drug discovery. Traditional feature-engineering approaches heavily rely on handcrafted descriptors and/or fingerprints, which need extensive human expert knowledge. With the rapid progress of artificial intelligence technology, data-driven deep learning methods have shown unparalleled advantages over feature-engineering-based methods. However, existing deep learning methods usually suffer from the scarcity of labeled data and the inability to share information between different tasks when applied to predicting molecular properties, thus resulting in poor generalization capability. Here, we proposed a novel multitask learning BERT (Bidirectional Encoder Representations from Transformer) framework, named MTL-BERT, which leverages large-scale pre-training, multitask learning, and SMILES (simplified molecular input line entry specification) enumeration to alleviate the data scarcity problem. MTL-BERT first exploits a large amount of unlabeled data through self-supervised pretraining to mine the rich contextual information in SMILES strings and then fine-tunes the pretrained model for multiple downstream tasks simultaneously by leveraging their shared information. Meanwhile, SMILES enumeration is used as a data enhancement strategy during the pretraining, fine-tuning, and test phases to substantially increase data diversity and help to learn the key relevant patterns from complex SMILES strings. The experimental results showed that the pretrained MTL-BERT model with few additional fine-tuning can achieve much better performance than the state-of-the-art methods on most of the 60 practical molecular datasets. Additionally, the MTL-BERT model leverages attention mechanisms to focus on SMILES character features essential to target properties for model interpretability.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
PDF的下载单位、IP信息已删除 (2025-6-4)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
陌上之心发布了新的文献求助10
1秒前
1秒前
ZZQ完成签到 ,获得积分10
2秒前
agegfsd发布了新的文献求助10
2秒前
大个应助苏莉婷采纳,获得10
2秒前
李健应助葛辉辉采纳,获得10
3秒前
3秒前
王晓曼完成签到,获得积分10
3秒前
止山发布了新的文献求助10
3秒前
3秒前
zy发布了新的文献求助10
3秒前
飞龙爵士发布了新的文献求助10
3秒前
hehahaha发布了新的文献求助10
3秒前
Green完成签到,获得积分10
4秒前
4秒前
Feiyan完成签到,获得积分10
4秒前
4秒前
lilei发布了新的文献求助10
4秒前
刘清河发布了新的文献求助10
4秒前
5秒前
上官若男应助Bolaka采纳,获得10
5秒前
5秒前
解语花发布了新的文献求助10
6秒前
7秒前
如意雅山完成签到,获得积分10
8秒前
chhh发布了新的文献求助10
8秒前
怡晨思艺完成签到,获得积分10
8秒前
8秒前
十月发布了新的文献求助10
9秒前
9秒前
zmy发布了新的文献求助10
10秒前
李天朕发布了新的文献求助20
10秒前
abc123完成签到,获得积分10
10秒前
超级大神发布了新的文献求助10
10秒前
墨羽发布了新的文献求助10
10秒前
彭于彦祖应助清爽难敌采纳,获得100
10秒前
龙彦完成签到,获得积分10
11秒前
11秒前
是奶柚啊发布了新的文献求助10
11秒前
蜗牛发布了新的文献求助10
11秒前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
Acute Mountain Sickness 2000
Handbook of Milkfat Fractionation Technology and Application, by Kerry E. Kaylegian and Robert C. Lindsay, AOCS Press, 1995 1000
A novel angiographic index for predicting the efficacy of drug-coated balloons in small vessels 500
Textbook of Neonatal Resuscitation ® 500
The Affinity Designer Manual - Version 2: A Step-by-Step Beginner's Guide 500
Affinity Designer Essentials: A Complete Guide to Vector Art: Your Ultimate Handbook for High-Quality Vector Graphics 500
热门求助领域 (近24小时)
化学 医学 生物 材料科学 工程类 有机化学 内科学 生物化学 物理 计算机科学 纳米技术 遗传学 基因 复合材料 化学工程 物理化学 病理 催化作用 免疫学 量子力学
热门帖子
关注 科研通微信公众号,转发送积分 5068898
求助须知:如何正确求助?哪些是违规求助? 4290461
关于积分的说明 13367590
捐赠科研通 4110300
什么是DOI,文献DOI怎么找? 2250926
邀请新用户注册赠送积分活动 1256106
关于科研通互助平台的介绍 1188606