亲爱的研友该休息了!由于当前在线用户较少,发布求助请尽量完整地填写文献信息,科研通机器人24小时在线,伴您度过漫漫科研夜!身体可是革命的本钱,早点休息,好梦!

Improving the Reliability of Molecular String Representations for Generative Chemistry

弦(物理) 生成语法 可靠性(半导体) 计算机科学 化学 计算化学 计算生物学 人工智能 物理 理论物理学 生物 热力学 功率(物理)
作者
Etienne Reboul,Zoe Wefers,Harish Prabakaran,Jérôme Waldispühl,Antoine Taly
出处
期刊:Journal of Chemical Information and Modeling [American Chemical Society]
标识
DOI:10.1021/acs.jcim.4c02261
摘要

Generative modeling for chemistry has advanced rapidly in recent years, but this surge in popularity raises a foundational question: which molecular representation is best suited for modern machine learning models? Despite not being designed for generative tasks, SMILES remains the most commonly used string-based representation. However, while SMILES follows strict syntactic rules, grammatically correct SMILES strings do not always correspond to valid molecules. SELFIES, an alternative grammar, addresses this limitation by ensuring that every string of SELFIES tokens represents a valid molecule. In this study, we comprehensively evaluate the limitations of both SMILES and SELFIES as representations for generative models. We analyze two key criteria for robust molecular generation: viability, which means that generated strings represent novel, unique molecules with correct valence, and fidelity, where the distribution of physicochemical properties from sampled molecules resembles that of the training data. We find that approximately one-fifth of the molecules generated using RDKit default canonical SMILES are invalid, failing the viability criterion. In contrast, all SELFIES-generated molecules are viable, but they deviate significantly from the training distribution, indicating low fidelity. To address these limitations, we develop data augmentation procedures for both representations. While simplifying the SELFIES grammar yields only modest gains in fidelity, our stochastic augmentation method for SMILES, ClearSMILES, significantly improves both viability and fidelity. ClearSMILES simplifies syntax by reducing the vocabulary size and explicitly encoding aromaticity via Kekule SMILES, making the string representations easier for models to process. Using ClearSMILES, the rate of invalid samples decreases by an order of magnitude, from 20 to 2.2%, and fidelity to the training distribution is also moderately improved.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
4秒前
Lucas应助云7采纳,获得10
6秒前
7秒前
9秒前
9秒前
9秒前
9秒前
NexusExplorer应助科研通管家采纳,获得30
10秒前
嘻嘻哈哈应助科研通管家采纳,获得10
10秒前
10秒前
嘻嘻哈哈应助科研通管家采纳,获得10
10秒前
10秒前
风华笔墨发布了新的文献求助10
13秒前
LiuYingkang发布了新的文献求助10
14秒前
我爱学习发布了新的文献求助10
23秒前
25秒前
开放靖易发布了新的文献求助10
29秒前
30秒前
酷波er应助冷静的鸿煊采纳,获得10
33秒前
云7发布了新的文献求助10
37秒前
小二郎应助可耐的冰萍采纳,获得30
45秒前
46秒前
52秒前
丘比特应助冷静的鸿煊采纳,获得10
1分钟前
杨乃彬发布了新的文献求助10
1分钟前
zLin发布了新的文献求助10
1分钟前
玛琳卡迪马完成签到,获得积分10
1分钟前
molihuakai应助开放靖易采纳,获得10
1分钟前
小马甲应助杨乃彬采纳,获得10
1分钟前
1分钟前
1分钟前
Sapphire发布了新的文献求助10
1分钟前
顾矜应助Sapphire采纳,获得10
1分钟前
1分钟前
自由如冰完成签到 ,获得积分10
2分钟前
2分钟前
2分钟前
abull完成签到,获得积分10
2分钟前
深情安青应助科研通管家采纳,获得10
2分钟前
爆米花应助科研通管家采纳,获得10
2分钟前
高分求助中
Adhesion Science: Principles & Practice 1234
Signals, Systems, and Signal Processing 610
Burger's Medicinal Chemistry and Drug Discovery 400
A Step-by-Step Guide to Qualitative Data Coding 2nd Edition 400
Impact of Storage Orientation and Duration on Prefilled Syringe Performance: Break-Loose and Glide Forces, and Injection Time Across Multiple Time Points 360
Programming for Chemical Engineers Using C, C++, and MATLAB 300
Upland Kenya wild flowers and ferns: a flora of the flowers, ferns, grasses, and sedges of highland Kenya 300
热门求助领域 (近24小时)
化学 材料科学 医学 生物 纳米技术 工程类 有机化学 化学工程 生物化学 计算机科学 物理 内科学 复合材料 催化作用 物理化学 光电子学 电极 细胞生物学 基因 无机化学
热门帖子
关注 科研通微信公众号,转发送积分 6658370
求助须知:如何正确求助?哪些是违规求助? 8410042
关于积分的说明 17981208
捐赠科研通 5858218
什么是DOI,文献DOI怎么找? 2973516
邀请新用户注册赠送积分活动 1949351
关于科研通互助平台的介绍 1872313