Multitask Pretraining Framework for Improving Predictivity of Machine Learning Chemical Bioactivity Models for Low-Data Endpoints

Chem Res Toxicol. 2026 Mar 26. doi: 10.1021/acs.chemrestox.6c00143. Online ahead of print.

ABSTRACT

Computational models are crucial for rapid hazard screening of novel chemicals when time and resources are not available for laboratory assessment. The rise of machine learning (ML) methods powering quantitative structure-activity relationship (QSAR) models has enabled data-driven development of predictive models for health effects screening. However, these models are typically single-task, meaning that they are trained on a single toxicological endpoint and lack transferability to similar tasks, i.e., the ability to predict chemicals’ effects on related endpoints. Thus, when predictions are needed for another endpoint, a new model must be trained from scratch. Further, single-task ML models are typically trained on very large, homogeneous data sets, which are not available for most adverse outcome endpoints. Effective hazard screening would benefit from approaches that can handle multiple small, noisy data sets recording complex chemical and biological mechanisms. To that end, we trained an ML model simultaneously on multiple tasks curated from moderate-sized (∼1000 observations) ToxCast data sets. To predict novel tasks from small (∼100 observations) ToxCast data sets, we combined our pretrained multitask model with a task-specific predictor, either a random forest or a neural network. These two components comprise a novel ML pipeline that generates and uses molecular representations from our multitask model. Compared to a common ML approach using standard chemical representations, our pipeline performed statistically better on a majority of tasks, regardless of the choice of downstream predictor. The advantage of the molecular representations from our multitask model, over those from a single-task model, is that they combine information on multiple effects to provide a model of chemical space that captures generalizable information. This work contributes to efforts to improve the utility of ML QSAR methods for predicting chemicals’ bioactivity on low-data toxicological endpoints.

PMID:41887802 | DOI:10.1021/acs.chemrestox.6c00143

By Nevin Manimala