Addressing Data Imbalance in Hydrological Machine Learning: Impact of Advanced Sampling Methods on Performance and Interpretability

可解释性随机森林特征（语言学）拉丁超立方体抽样采样（信号处理）机器学习数据集集合（抽象数据类型）计算机科学人工智能简单随机抽样训练集过采样限制特征向量数据挖掘数据建模水文模型森林覆盖培训（气象学）绘图（图形）特征选择插值（计算机图形学）数据空间土地覆盖

作者

Xiaoran Yin,Longcang Shu,Zhe Wang,Long Zhou,Shuyao Niu,Huazhun Ren,Bo Liu,Chengpeng Lu

出处

期刊：Water Resources Research [Wiley]
日期：2025-10-01 卷期号：61 (10) 被引量：3

链接

doi.org doaj.orgdoi.org

标识

DOI：10.1029/2024wr039848

摘要

Abstract Data imbalance poses a severe challenge in hydrological machine learning (ML) applications by limiting model performance and interpretability, whereas solutions remain limited. This study evaluates the impact of advanced sampling methods, particularly feature space coverage sampling (FSCS), on model performance in predicting forest cover types and saturated hydraulic conductivity (Ks); mechanism underlying its efficacy; and impact on model interpretability. Using ML algorithms such as random forest (RF) and LightGBM (LGB) across various training set sizes, we demonstrated that FSCS significantly mitigates data imbalance, enhancing model accuracy, feature importance estimation, and interpretability. Two widely used hydrological data sets were analyzed: a large multiclass forest cover type data set from Roosevelt National Forest (110,393 samples) and continuous‐value data set of soil properties from the USKSAT database (18,729 samples). In total, 1,720 models were constructed and optimized, combining different sampling methods, training set sizes, and algorithms. Balanced sampling, conditioned Latin hypercube sampling, and FSCS consistently outperformed simple random sampling. Despite using smaller training sets and simpler RF models, FSCS‐trained models matched or surpassed the performance of those using larger data sets or more complex LGB models. SHAP analysis revealed that FSCS enhanced feature–target relationship clarity, emphasizing feature interactions and improving model interpretability. These findings highlight the potential of advanced sampling methods for not only addressing data imbalance but also providing more accurate prior information for model training, thereby enhancing reliability, accuracy, and interpretability in ML for hydrological applications.

求助该文献

最长约 10秒，即可获得该文献文件

Addressing Data Imbalance in Hydrological Machine Learning: Impact of Advanced Sampling Methods on Performance and Interpretability

今日热心研友