Fusing Domain Knowledge with a Fine-Tuned Large Language Model for Enhanced Molecular Property Prediction

计算机科学财产（哲学）领域（数学分析）人工智能自然语言处理数据科学数学认识论数学分析哲学

作者

Liangxu Xie,Ying-Di Jin,Lei Xu,Shan Chang,Xiaojun Xu

出处

期刊：Journal of Chemical Theory and Computation [American Chemical Society]
日期：2025-07-09 卷期号：21 (14): 6743-6758 被引量：4

链接

nih.govdoi.org

标识

DOI：10.1021/acs.jctc.5c00605

摘要

Although large language models (LLMs) have flourished in various scientific applications, their applications in the specific task of molecular property prediction have not reached a satisfactory level, even for the specific chemistry LLMs. This work addresses a highly crucial and significant challenge existing in the field of drug discovery: accurately predicting the molecular properties by effectively leveraging LLMs enhanced with profound domain knowledge. We propose a Knowledge-Fused Large Language Model for dual-Modality (KFLM2) learning for molecular property prediction. The aim is to utilize the capabilities of advanced LLMs, strengthened with specialized knowledge in the field of drug discovery. We identified DeepSeek-R1-Distill-Qwen-1.5B as the optimal base model from three DeepSeek-R1 distilled LLMs and one chemistry LLM named ChemDFM, by fine-tuning with the ZINC and ChEMBL datasets. We obtained the SMILES embeddings from the fine-tuned model and subsequently integrated the embeddings with the molecular graph to leverage complementary information for predicting molecular properties. Finally, we trained the hybrid neural network on the combined dual modality inputs and predicted the molecular properties. Through benchmarking on regression and classification tasks, our proposed method can obtain higher prediction performance for nine out of ten datasets in the downstream regression and classification tasks. Visualization of the output of hidden layers indicates that the combination of the embedding with the molecular graph can offer complementary information to further improve the prediction accuracy compared with either the LLM embedding or the molecular graph inputs. Larger models do not inherently guarantee superior performance; instead, their effectiveness hinges on our ability to leverage relevant knowledge from both pretraining and fine-tuning. Implementing LLMs with domain knowledge would be a rational approach to making precise predictions that could potentially revolutionize the process of drug development and discovery.

求助该文献

最长约 10秒，即可获得该文献文件

Fusing Domain Knowledge with a Fine-Tuned Large Language Model for Enhanced Molecular Property Prediction

今日热心研友