Interpretable Yield Prediction of Supercritical CO 2 Extraction from Various Essential Oil Sources Using Optimized Machine Learning and PCA-Based Descriptors
Predicting essential oil yield in supercritical CO2 (SC-CO2) extraction remains difficult due to variations in plant composition and process conditions. Conventional models often assume uniform feedstock behavior, which limits their applicability across diverse species. This study develops machine learning models that integrate extraction parameters with principal component analysis (PCA)-based molecular descriptors representing the seven major compounds of each essential oil source. A data set of 1313 experimental records from 42 plant species was compiled to train three algorithms: LightGBM (LGBMR), HistGradientBoosting (HGBR), and Extra Trees (ETR). The models were optimized using four metaheuristic algorithms to improve their predictive accuracy. All models achieved high predictive performance (R2 > 0.97). The ETR model optimized by a genetic algorithm (ETR-3PCs-GA) attained the highest performance (R2 = 0.9808, root-mean-square error (RMSE) = 0.7802), while the HGBR model with two principal components and GA optimization (HGBR-2PCs-GA) demonstrated superior ability to predict dynamic extraction profiles (RMSE = 0.408). SHapley Additive exPlanations (SHAP) analysis identified pressure and selected PCA coordinates as the most influential features, revealing that both process parameters and molecular composition jointly determine extraction efficiency. The model successfully generalized yield prediction across species and reproduced known process trends, such as the positive effects of pressure and flow rate on yield. The findings also indicate a synergistic effect, whereby the entire molecular profile, not just the most abundant compounds, governs the final yield. This approach demonstrates that integrating molecular-level information with process data can provide transferable, interpretable models for optimizing SC-CO2 extraction of essential oils.