计算机科学
自然语言处理
语言模型
人工智能
统一医学语言系统
程序设计语言
作者
Xiaoqiang Tang,Andrew Tran,Jeffrey Too Chuan Tan,Mark Gerstein
出处
期刊:Bioinformatics
[Oxford University Press]
日期:2024-06-28
卷期号:40 (Supplement_1): i357-i368
标识
DOI:10.1093/bioinformatics/btae260
摘要
Abstract Motivation The current paradigm of deep learning models for the joint representation of molecules and text primarily relies on 1D or 2D molecular formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits the models’ versatility and adaptability across a wide range of modalities. Conversely, the limited research focusing on explicit 3D representation tends to overlook textual data within the biomedical domain. Results We present a unified pre-trained language model, MolLM, that concurrently captures 2D and 3D molecular information alongside biomedical text. MolLM consists of a text Transformer encoder and a molecular Transformer encoder, designed to encode both 2D and 3D molecular structures. To support MolLM’s self-supervised pre-training, we constructed 160K molecule-text pairings. Employing contrastive learning as a supervisory signal for learning, MolLM demonstrates robust molecular representation capabilities across four downstream tasks, including cross-modal molecule and text matching, property prediction, captioning, and text-prompted molecular editing. Through ablation, we demonstrate that the inclusion of explicit 3D representations improves performance in these downstream tasks. Availability and implementation Our code, data, pre-trained model weights, and examples of using our model are all available at https://github.com/gersteinlab/MolLM. In particular, we provide Jupyter Notebooks offering step-by-step guidance on how to use MolLM to extract embeddings for both molecules and text.
科研通智能强力驱动
Strongly Powered by AbleSci AI