Environ Monit Assess. 2026 May 27;198(6):657. doi: 10.1007/s10661-026-15505-9.
ABSTRACT
Continuous and uninterrupted air quality monitoring is essential for environmental management and public policy formulation, which requires the absence of missing data and good quality measurements. However, due to a variety of factors (local power outages, data transmission, instrument calibration, preventive maintenance, weather conditions, etc.), measurement gaps with different time windows frequently occur in historical air quality data. This work addresses the problem of missing data in air quality monitoring time series, which compromises the quality of information and hinders decision-making related to air pollution. Carbon monoxide (CO) data were imputed in artificially generated gaps (from 24 to 72 h) for a monitoring station located in Salvador, Bahia (Brazil). Three dynamic modeling strategies with different architectures and learning algorithms were applied: XGboost and two recurrent neural networks (LSTM and RNN). The results showed that, although XGboost presented the lowest medians associated with RMSE and MAE distributions (0.1028 and 0.1266 ppm, respectively), the difference compared to the neural networks was not statistically significant. The statistical analysis of the predictions showed that the mean of the residuals does not differ significantly from zero, indicating an absence of systematic bias and suggesting that the imputed values preserve the dominant dynamics and seasonal patterns of the original series. The percentages of gaps consistently described by the models were 82.0% (XGboost) and 91.3% (LSTM and RNN recurrent neural networks). The results demonstrate that the adopted model structures (decision tree and recurrent neural networks), along with a systematic approach involving the analysis and preparation of the training sample (identification of input variables, mapping of existing gaps in the historical data of the measurement station, and generation of artificial gaps, among others), enabled the imputation of dynamic CO data, preserving the dominant behavior of the time series and ensuring the validity of environmental monitoring.
PMID:42192051 | DOI:10.1007/s10661-026-15505-9