Eur Radiol. 2026 Apr 23. doi: 10.1007/s00330-026-12543-2. Online ahead of print.
ABSTRACT
OBJECTIVE: To systematically evaluate training sample size adequacy in externally validated machine learning (ML)-based radiomics models published in high-impact journals and quantify the gap between current practice and theoretical minimum requirements.
MATERIALS AND METHODS: This study followed a prespecified and publicly archived protocol. Original research articles published between January 2023 and August 2025 in first quartile (Q1) journals were evaluated. Study selection followed a randomized dynamic screening protocol with a priori power-calculated stopping rule to determine the final cohort. Included studies developed binary prediction models using ML algorithms other than logistic regression and reported external validation. A sample size framework, originally developed for logistic regression, was applied as a conservative lower-bound benchmark. Minimum required sample sizes were calculated based on reported training performance, outcome prevalence, and feature dimensionality.
RESULTS: Of 64 full-text records assessed, 16 (25%) were unassessable due to missing essential parameters (e.g., feature counts) required for sample size estimation. In the assessable final cohort (n = 28), the training sample sizes observed were consistently inadequate, with a median deficit of 195.5 training instances. Only three studies (10.7%) met all criteria for stable prediction model development even under these charitable assumptions. Most studies failed basic heuristics (e.g., 10 events per predictor), with a median events per predictor deficit of 5.8.
CONCLUSION: The vast majority of externally validated radiomics models in high-impact journals are trained on datasets statistically insufficient to support their algorithmic complexity. This systemic data deficit renders models prone to overfitting and instability, potentially explaining the field’s reproducibility crisis.
KEY POINTS: Question Do externally validated machine learning-based radiomics models in high-impact (Q1) journals have sufficient training sample sizes to support their reported model complexity? Findings Nearly 90% of externally validated radiomics models were trained on statistically insufficient datasets, with a median deficit of approximately 200 training instances. Clinical relevance Insufficient training sample sizes undermine model stability and contribute to the reproducibility crisis in radiomics, allowing externally validated models to appear robust while generating unreliable predictions that may misinform clinical decision-making.
PMID:42020624 | DOI:10.1007/s00330-026-12543-2