Chemistry. 2026 Feb 4:e03299. doi: 10.1002/chem.202503299. Online ahead of print.
ABSTRACT
Machine learning (ML) models are increasingly used in quantum chemistry, but their reliability hinges on uncertainty quantification (UQ). In this study, we compare two prominent UQ paradigms-deep evidential regression (DER) and deep ensembles-on the QM9 and WS22 datasets, with a specific emphasis on the role of post hoc calibration. Raw uncertainties from both methods were systematically miscalibrated: DER produced uncertainty estimates where data noise and model uncertainty were not cleanly separated, while ensembles produced sharper yet underconfident estimates. Applying calibration techniques such as isotonic regression (ISR), standard scaling, and GP-Normal corrected these deficiencies, aligning predicted variances with observed errors. On QM9, calibration enabled DER to filter high-confidence predictions more effectively than ensembles. On WS22, calibrated ensembles not only improved statistical reliability but also delivered substantial computational savings in active learning, reducing redundant ab initio evaluations by more than 20%. These results demonstrate that post hoc calibration is essential to transform uncertainty estimates from descriptive metrics into actionable signals, ensuring both trustworthy predictions and resource-efficient molecular modeling.
PMID:41635978 | DOI:10.1002/chem.202503299