Exploring Detection Methods for Synthetic Medical Datasets Created With a Large Language Model

JAMA Ophthalmol. 2025 Apr 24. doi: 10.1001/jamaophthalmol.2025.0834. Online ahead of print.

ABSTRACT

IMPORTANCE: Recently, it was proved that the large language model Generative Pre-trained Transformer 4 (GPT-4; OpenAI) can fabricate synthetic medical datasets designed to support false scientific evidence.

OBJECTIVE: To uncover statistical patterns that may suggest fabrication in datasets produced by large language models and to improve these synthetic datasets by attempting to remove detectable marks of nonauthenticity, investigating the limits of generative artificial intelligence.

DESIGN, SETTING, AND PARTICIPANTS: In this quality improvement study, synthetic datasets were produced for 3 fictional clinical studies designed to compare the outcomes of 2 alternative treatments for specific ocular diseases. Synthetic datasets were produced using the default GPT-4o model and a custom GPT. Data fabrication was conducted in November 2024.

EXPOSURE: Prompts were submitted to GPT-4o to produce 12 “unrefined” datasets, which underwent forensic examination. Based on the outcomes of this analysis, the custom GPT Synthetic Data Creator was built with detailed instructions to generate 12 “refined” datasets designed to evade authenticity checks. Then, forensic analysis was repeated on these enhanced datasets.

MAIN OUTCOMES AND MEASURES: Forensic analysis was performed to identify statistical anomalies in demographic data, distribution uniformity, and repetitive patterns of last digits, as well as linear correlations, distribution shape, and outliers of study variables. Datasets were also qualitatively assessed for the presence of unrealistic clinical records.

RESULTS: Forensic analysis identified 103 fabrication marks among 304 tests (33.9%) in unrefined datasets. Notable flaws included mismatch between patient names and gender (n = 12), baseline visits occurring during weekends (n = 12), age calculation errors (n = 9), lack of uniformity (n = 4), and repetitive numerical patterns in last digits (n = 7). Very weak correlations (r < 0.1) were observed between study variables (n = 12). In addition, variables showed a suspicious distribution shape (n = 6). Compared with unrefined datasets, refined ones showed 29.3% (95% CI, 23.5%-35.1%) fewer signs of fabrication (14 of 304 statistical tests performed [4.6%]). Four refined datasets passed forensic analysis as authentic; however, suspicious distribution shape or other issues were found in others.

CONCLUSIONS AND RELEVANCE: Sufficiently sophisticated custom GPTs can perform complex statistical tasks and may be abused to fabricate synthetic datasets that can pass forensic analysis as authentic.

PMID:40272814 | DOI:10.1001/jamaophthalmol.2025.0834

By Nevin Manimala