OPERA models for predicting physicochemical properties and environmental fate endpoints

数量结构-活动关系适用范围分子描述符计算机科学工作流程数据挖掘化学数据库试验装置集合（抽象数据类型）预测建模机器学习数据库化学有机化学程序设计语言

作者

Kamel Mansouri,Chris Grulke,Richard Judson,Antony Williams

出处

期刊：Journal of Cheminformatics [BioMed Central]
日期：2018-03-08 卷期号：10 (1) 被引量：439

链接

biomedcentral.com biomedcentral.com osti.gov doaj.org europepmc.org europepmc.org nih.gov nih.govdoi.org

标识

DOI：10.1186/s13321-018-0263-1

摘要

The collection of chemical structure information and associated experimental data for quantitative structure–activity/property relationship (QSAR/QSPR) modeling is facilitated by an increasing number of public databases containing large amounts of useful data. However, the performance of QSAR models highly depends on the quality of the data and modeling methodology used. This study aims to develop robust QSAR/QSPR models for chemical properties of environmental interest that can be used for regulatory purposes. This study primarily uses data from the publicly available PHYSPROP database consisting of a set of 13 common physicochemical and environmental fate properties. These datasets have undergone extensive curation using an automated workflow to select only high-quality data, and the chemical structures were standardized prior to calculation of the molecular descriptors. The modeling procedure was developed based on the five Organization for Economic Cooperation and Development (OECD) principles for QSAR models. A weighted k-nearest neighbor approach was adopted using a minimum number of required descriptors calculated using PaDEL, an open-source software. The genetic algorithms selected only the most pertinent and mechanistically interpretable descriptors (2–15, with an average of 11 descriptors). The sizes of the modeled datasets varied from 150 chemicals for biodegradability half-life to 14,050 chemicals for logP, with an average of 3222 chemicals across all endpoints. The optimal models were built on randomly selected training sets (75%) and validated using fivefold cross-validation (CV) and test sets (25%). The CV Q2 of the models varied from 0.72 to 0.95, with an average of 0.86 and an R2 test value from 0.71 to 0.96, with an average of 0.82. Modeling and performance details are described in QSAR model reporting format and were validated by the European Commission’s Joint Research Center to be OECD compliant. All models are freely available as an open-source, command-line application called OPEn structure–activity/property Relationship App (OPERA). OPERA models were applied to more than 750,000 chemicals to produce freely available predicted data on the U.S. Environmental Protection Agency’s CompTox Chemistry Dashboard.

求助该文献

OPERA models for predicting physicochemical properties and environmental fate endpoints

今日热心研友