Seung Kyun Ha,Dipannita Kalyani,Michael S. West,Jessica Xu,Yu‐hong Lam,Thomas J. Struble,Spencer D. Dreher,Shane W. Krska,Stephen L. Buchwald,Klavs F. Jensen
This manuscript presents machine learning models for Pd-catalyzed C-N couplings constructed using a large, pharmaceutically relevant, structurally diverse dataset (4204 unique products) generated de novo using high-throughput experimentation. The dataset generation was enabled by the discovery of novel nanomole scale compatible automation friendly C-N coupling reaction conditions using LiOTMS as the base. The large dataset enabled the systematic evaluation of model performance using five different data-splitting strategies that were carefully designed to assess the models' ability to both interpolate and extrapolate. The models exhibit high predictive performance across all splits as gauged by standard metrics. In addition, the models predicted with high accuracy the outcome of validation libraries that were outside the scope of the training set. Employing these models in the context of medicinal chemistry campaigns should result in significant enrichment of successful C-N couplings.