Exploring Detection Methods for Synthetic Medical Datasets Created With a Large Language Model

离群值计算机科学人工智能统计模型异常检测自然语言处理机器学习数据挖掘医学

作者

Andrea Taloni,Giulia Coco,Marco Pellegrini,Matthias Wjst,Niccolò Salgari,Giovanna Carnovale-Scalzo,Vincenzo Scorcia,Massimo Busin,Giuseppe Giannaccare

出处

期刊：JAMA Ophthalmology [American Medical Association]
日期：2025-04-24 被引量：1

链接

nih.govdoi.org

标识

DOI：10.1001/jamaophthalmol.2025.0834

摘要

Importance Recently, it was proved that the large language model Generative Pre-trained Transformer 4 (GPT-4; OpenAI) can fabricate synthetic medical datasets designed to support false scientific evidence. Objective To uncover statistical patterns that may suggest fabrication in datasets produced by large language models and to improve these synthetic datasets by attempting to remove detectable marks of nonauthenticity, investigating the limits of generative artificial intelligence. Design, Setting, and Participants In this quality improvement study, synthetic datasets were produced for 3 fictional clinical studies designed to compare the outcomes of 2 alternative treatments for specific ocular diseases. Synthetic datasets were produced using the default GPT-4o model and a custom GPT. Data fabrication was conducted in November 2024. Exposure Prompts were submitted to GPT-4o to produce 12 “unrefined” datasets, which underwent forensic examination. Based on the outcomes of this analysis, the custom GPT Synthetic Data Creator was built with detailed instructions to generate 12 “refined” datasets designed to evade authenticity checks. Then, forensic analysis was repeated on these enhanced datasets. Main Outcomes and Measures Forensic analysis was performed to identify statistical anomalies in demographic data, distribution uniformity, and repetitive patterns of last digits, as well as linear correlations, distribution shape, and outliers of study variables. Datasets were also qualitatively assessed for the presence of unrealistic clinical records. Results Forensic analysis identified 103 fabrication marks among 304 tests (33.9%) in unrefined datasets. Notable flaws included mismatch between patient names and gender (n = 12), baseline visits occurring during weekends (n = 12), age calculation errors (n = 9), lack of uniformity (n = 4), and repetitive numerical patterns in last digits (n = 7). Very weak correlations ( r &lt; 0.1) were observed between study variables (n = 12). In addition, variables showed a suspicious distribution shape (n = 6). Compared with unrefined datasets, refined ones showed 29.3% (95% CI, 23.5%-35.1%) fewer signs of fabrication (14 of 304 statistical tests performed [4.6%]). Four refined datasets passed forensic analysis as authentic; however, suspicious distribution shape or other issues were found in others. Conclusions and Relevance Sufficiently sophisticated custom GPTs can perform complex statistical tasks and may be abused to fabricate synthetic datasets that can pass forensic analysis as authentic.

求助该文献

Exploring Detection Methods for Synthetic Medical Datasets Created With a Large Language Model

今日热心研友