Data-driven frameworks to robustly predict solubility parameter of diverse polymers

Sci Rep. 2025 Aug 25;15(1):31157. doi: 10.1038/s41598-025-12758-1.

ABSTRACT

This study intends to effectively forecast solubility parameter of diverse polymers by creating machine learning models that can grasp the complex relationships between essential input factors like molecular weight, melting point, boiling point, liquid molar volume, radius of gyration, dielectric constant, dipole moment, refractive index, van der Waals area and reduced volume, and parachor, alongside the target variable, which is solubility coefficient of polymers. The goal is to create strong models that accurately capture these intricate relationships to facilitate accurate forecasts of the solubility parameter for polymers. Multiple machine learning algorithms, ranging from basic methods like Linear Regression to advanced techniques such as Artificial Neural Networks (ANNs), Ridge Regression, Lasso Regression, Support Vector Machines (SVMs), Linear Regression, Random Forests (RFs), Gradient Boosting Machines (GBM), K-Nearest Neighbors (KNN), Elastic Net, Decision Trees, Light Gradient Boosting Machine (LightGBM), Categorical Boosting (CatBoost), Convolutional Neural Networks (CNNs), and Extreme Gradient Boosting (XGBoost) were utilized. These methods were utilized to create data-driven models that adeptly seize the intricate connections between input characteristics and output variable, facilitating precise predictions of the solubility parameter for polymers. The efficacy of the developed models was rigorously evaluated using statistical metrics such as R², RMSE, and MRD%, along with visual tools including cross-plots, deviation plots, and SHAP analysis to enhance interpretability and predictive reliability. To guarantee the dataset’s reliability, consisting of 1,799 datapoints on the solubility parameter of polymers, the Monte Carlo outlier detection algorithm was utilized. This stage verified the dataset’s accuracy and appropriateness for model training and evaluation. Results indicated that the models CatBoost, ANN, and CNN surpassed other techniques, attaining superior accuracy shown by the highest R-squared values and the lowest error rates. Sensitivity analysis showed that every input feature impacted the target variable, while SHAP analysis determined that dielectric constant was the most significant factor influencing the solubility parameter of polymers. These results highlight the efficiency of the utilized machine learning methods and emphasize the vital importance of these input parameters in establishing the solubility parameter of polymers. This method not only verifies that the models can make accurate predictions but also provides valuable insights into the impact of input features on solubility parameters of polymers, enhancing algorithm interpretability and scientific understanding.

PMID:40851024 | DOI:10.1038/s41598-025-12758-1

By Nevin Manimala