Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS
串联
计算机科学
人工智能
工程类
航空航天工程
作者
Roman Bushuiev,Anton Bushuiev,Raman Samusevich,Corinna Brungs,Josef Šivic,Tomáš Pluskal
标识
DOI:10.26434/chemrxiv-2023-kss3r-v4
摘要
Characterizing biological and environmental samples at a molecular level primarily uses tandem mass spectroscopy (MS/MS), yet the interpretation of tandem mass spectra from untargeted metabolomics experiments remains a challenge. Existing computational methods for predictions from mass spectra rely on limited spectral libraries and on hard-coded human expertise. Here, we introduce a transformer-based neural network pre-trained in a self-supervised way on millions of unannotated tandem mass spectra from our GNPS Experimental Mass Spectra (GeMS) dataset mined from the MassIVE GNPS repository. We show that pre-training our model to predict masked spectral peaks and chromatographic retention orders leads to the emergence of rich representations of molecular structures, which we name Deep Representations Empowering the Annotation of Mass Spectra (DreaMS). Further fine-tuning the neural network yields state-of-the-art performance across a variety of tasks. We make our new dataset and model available to the community and release the DreaMS Atlas -- a molecular network of 201 million MS/MS spectra constructed using DreaMS annotations.