插补(统计学)
缺少数据
计算机科学
均方误差
随机森林
外推法
数据挖掘
数据集
统计
数学
人工智能
作者
Chunhui Xie,Rui Li,Yunqi Li,Haibo Xie,Qibin Liu
标识
DOI:10.1021/acs.jctc.4c01237
摘要
Missing data in tabular data sets is ubiquitous in statistical analysis, big data analysis, and machine learning studies. Many strategies have been proposed to impute missing data, but their reliability has not been stringently assessed in materials science. Here, we carried out a benchmark test for six imputation strategies: Mean, MissForest, HyperImpute, Gain, Sinkhorn, and a newly proposed MatImpute on seven representative data sets in materials science. The imputation-induced errors (IIEs) were evaluated through the difference between imputed and original values, by root mean square error (RMSE), Wasserstein distance (WD), and a newly introduced metrics data set correlation convergence (DCC), to measure the difference at three aspects for individual data, column-wise distribution, and correlation stability of a data set. MatImpute outperformed the others with the least RMSE and WD and the highest DCC. The IIE increases with the increase of data missing ratio and in the order of missing at random < missing completely at random ≤ missing not at random, considering inherent correlations among missing data. A similar trend was observed for the increase of IIE along the central departure distance in units of the standard deviation, which is consistent with the increase of difficulty from interpolation to extrapolation. Further tests of IIE in regression and classification machine learning predictive models, MatImpute also preserved the highest data recovery fidelity. We released the code of MatImpute to facilitate the construction of high-quality data sets in materials science.
科研通智能强力驱动
Strongly Powered by AbleSci AI