J Environ Qual. 2026 Jan-Feb;55(1):e70141. doi: 10.1002/jeq2.70141.
ABSTRACT
High total phosphorus (TP) concentrations in freshwater such as streams, rivers, and lakes remain a global issue, causing eutrophication and poor ecological conditions. We extracted annual flow-weighted TP concentration data (N = 3144) from 207 monitored Danish headwater streams (catchment area < 50 km2) during the period 1990-2019 as the basis for developing a machine learning (ML) model for diffuse phosphorus in streams, using 22 potential predictor variables. We tested 70 different algorithms, applying an AI platform and a random split strategy with a holdout dataset (20%), a validation dataset (16%), and four training datasets (4 × 16%) in a fivefold cross-validation procedure across a total of 207 stream stations. The ML algorithm “eXtreme Gradient Boosted trees regressor: XGBoost” was identified as the best-performing model, based on 13 predictor variables, with a relatively high explanatory power for the training dataset (R2 = 0.68), the cross-validation dataset (R2 = 0.71), and the holdout dataset (R2 = 0.66). The most important catchment characteristics of the 13 predictor variables were paved area, tile drained area, clay content in subsoil, farmed area, and TP loss from bank erosion. An external test of model transferability on a dataset, using an additional 142 stream stations (N = 1261), revealed a somewhat lower predictive power of the final model (R2 = 0.41). The final model was applied to simulate annual diffuse flow-weighted TP concentrations for 3200 unique headwater catchments (ca. 15 km2) and this analysis now supports the calculation of annual TP loadings to surface waters from otherwise ungauged coast-near areas in Denmark.
PMID:41618754 | DOI:10.1002/jeq2.70141