Generating synthetic electronic health record data: a methodological scoping review with benchmarking on phenotype data and open-source software

标杆管理计算机科学忠诚 Python（编程语言）水准点（测量）基线（sea）数据挖掘机器学习软件开放科学数据科学开源人工智能物理营销天文业务操作系统电信海洋学大地测量学地质学程序设计语言地理

作者

X Chen,Zhenke Wu,Xu Shi,Hyunghoon Cho,Bhramar Mukherjee

出处

期刊：Journal of the American Medical Informatics Association [Oxford University Press]
日期：2025-05-12

链接

nih.govdoi.org

标识

DOI：10.1093/jamia/ocaf082

摘要

To conduct a scoping review (ScR) of existing approaches for synthetic Electronic Health Records (EHR) data generation, to benchmark major methods, and to provide an open-source software and offer recommendations for practitioners. We search three academic databases for our scoping review. Methods are benchmarked on open-source EHR datasets, Medical Information Mart for Intensive Care III and IV (MIMIC-III/IV). Seven existing methods covering major categories and two baseline methods are implemented and compared. Evaluation metrics concern data fidelity, downstream utility, privacy protection, and computational cost. Forty-eight studies are identified and classified into five categories. Seven open-source methods covering all categories are selected, trained on MIMIC-III, and evaluated on MIMIC-III or MIMIC-IV for transportability considerations. Among them, Generative Adversarial Network (GAN)-based methods demonstrate competitive performance in fidelity and utility on MIMIC-III, rule-based methods excel in privacy protection. Similar findings are observed on MIMIC-IV, except that GAN-based methods further outperform the baseline methods in preserving fidelity. Method choice is governed by the relative importance of the evaluation metrics in downstream use cases. We provide a decision tree to guide the choice among the benchmarked methods. An extensible Python package, "SynthEHRella", is provided to facilitate streamlined evaluations. GAN-based methods excel when distributional shifts exist between the training and testing populations. Otherwise, CorGAN and MedGAN are most suitable for association modeling and predictive modeling, respectively. Future research should prioritize enhancing fidelity of the synthetic data while controlling privacy exposure, and comprehensive benchmarking of longitudinal or conditional generation methods.

求助该文献

最长约 10秒，即可获得该文献文件

Generating synthetic electronic health record data: a methodological scoping review with benchmarking on phenotype data and open-source software

今日热心研友