Critical Artifacts Improve Reproducibility of Protein-Ligand Binding Affinity Prediction Models on CASF-2016

J Chem Inf Model. 2026 Jun 24. doi: 10.1021/acs.jcim.6c01192. Online ahead of print.

ABSTRACT

Protein-ligand binding affinity prediction (PLBAP) models are routinely benchmarked on the CASF-2016 data set with Pearson correlation coefficient (PCC) as a common measure of scoring power. Published PCC values are frequently reused as baselines for cross-study comparisons. This practice implicitly assumes that published pipelines remain runnable and that reported metrics can be independently verified. To examine this assumption, we conducted a systematic reproducibility audit of 50 PLBAP models published between 2021 and 2024 that reported CASF-2016 scoring power. For each model, we attempted to reproduce the authors’ CASF-2016 inference using only publicly available code, documentation, and pretrained weights. To scaffold this audit and to offer a reusable resource for the community, we introduce a minimal five-item reproducibility checklist for PLBAP pipelines, organized around the artifacts a researcher requires to independently rerun inference: (1) a license; (2) preprocessing and featurization, (3) training, and (4) inference code; and (5) pretrained model weights. We find that only 17/50 pipelines satisfied all checklist items to be consistently runnable. Of those 17 runnable models, only nine were statistically reproducible (53% of models). We propose the checklist as a lightweight community standard for future PLBAP releases, document common gaps, and highlight practices that most reliably enabled independent reproduction.

PMID:42341287 | DOI:10.1021/acs.jcim.6c01192

By Nevin Manimala