摘要
Lipid nanoparticles (LNPs) are highly effective carriers for gene therapies, including mRNA and siRNA delivery, due to their ability to transport nucleic acids across biological membranes, low cytotoxicity, improved pharmacokinetics, and scalability. A typical approach to formulate LNPs is to establish a quantitative structure-activity relationship (QSAR) between their compositions and in vitro/in vivo activities, which allows for the prediction of activity based on molecular structure. However, developing QSAR for LNPs can be challenging due to the complexity of multicomponent formulations, interactions with biological membranes, stability in physiological environments, and diverse physicochemical properties. To address these challenges, we developed a machine-learning (ML) framework to predict the activity and cell viability of LNPs for nucleic acid delivery. We curated data from 6454 LNP formulations reported across 21 independent studies and implemented 11 different molecular featurization techniques, ranging from descriptors and fingerprints to graph-based representations, alongside six ML algorithms for binary and multiclass classification. Using scaffold-based 5-fold cross-validation, our models achieved classification accuracies exceeding 90% for both activity and cell viability prediction tasks. Among all model-feature combinations, descriptor-based features combined with ensemble models such as balanced random forest and extra trees yielded the highest performance. Through SHAP-based feature attribution and interaction analysis, we identified key physicochemical properties and compositional features driving the LNP performance, highlighting the importance of synergistic effects among multiple molecular features. Furthermore, we developed a transfer-learning strategy to bridge in vitro-to-in vivo prediction gaps by incorporating base model predictions along with additional biological attributes, such as the particle size, polydispersity index, and ζ potential. Despite the smaller size and inherent class imbalance of the in vivo data set, the transfer-learning models demonstrated a promising predictive performance, with accuracies exceeding 82%. Our findings underscore the potential of interpretable ML frameworks to guide rational LNP design and provide a scalable approach to QSAR modeling in complex nanomedicine systems.