计算机科学
机器学习
人工智能
数据驱动
计算生物学
数据科学
精密医学
大数据
作者
Li Rong Wang,Limsoon Wong,Wilson Wen Bin Goh
标识
DOI:10.1016/j.drudis.2021.10.017
摘要
Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgangers. Data doppelgangers occur when independently derived data are very similar to each other, causing models to perform well regardless of how they are trained (i.e., the doppelganger effect). Despite the abundance of data doppelgangers in biomedical data and their inflationary effects, they remain uncharacterized. We show their prevalence in biomedical data, demonstrate how doppelgangers arise, and provide proof of their confounding effects. To mitigate the doppelganger effect, we recommend identifying data doppelgangers before the training-validation split.
科研通智能强力驱动
Strongly Powered by AbleSci AI