Sci Rep. 2026 Apr 21. doi: 10.1038/s41598-026-49819-y. Online ahead of print.
ABSTRACT
Grain number estimation plays a crucial role in agriculture, serving as a key indicator for crop yield and quality assessment. With advances in computer vision, automatic grain detection has become a significant research area, where deep learning methods have shown remarkable promise. This study proposes a vision transformer model called Swin Transformer, which leverages hierarchical attention mechanisms across shifted windows to effectively capture both local and global features of grains in complex imagery. The model achieves the highest accuracy of 98%, outperforming baseline traditional CNN (ResNet-50) and DINO models in grain counting tasks. To support and validate model performance, explainable AI (XAI) techniques such as Grad-CAM and LIME are employed, highlighting the interpretability and focus of the model on relevant grain regions. Furthermore, a comprehensive empirical analysis is conducted using multiple statistical tests to evaluate the model’s robustness and generalizability across various grain morphological parameters, establishing the Swin Transformer as a powerful and interpretable solution for intelligent grain counting in agricultural data analytics.
PMID:42014763 | DOI:10.1038/s41598-026-49819-y