Deep Learning for Brain Tumour Analysis: A Systematic Review of CNN-Transformer Hybrids in Multimodal Imaging

Int J Biomed Imaging. 2026 Jun 16;2026:4763936. doi: 10.1155/ijbi/4763936. eCollection 2026.

ABSTRACT

BACKGROUND: Brain tumour detection and analysis using medical imaging requires the extraction of both local spatial features and global contextual representations. Although convolutional neural networks (CNNs) excel at capturing local spatial patterns and Transformer-based architectures model long-range dependencies effectively, the optimal architectural paradigm for clinical deployment remains unresolved. This systematic review and meta-analysis evaluates hybrid CNN-Transformer architectures for brain tumour detection, focusing on the integration of local and global feature learning, diagnostic accuracy and computational efficiency. The roles of generative adversarial networks (GANs) for addressing data scarcity and multimodal imaging fusion for diagnostic completeness are also critically examined.

METHODS: A systematic search was conducted across IEEE Xplore, PubMed, Scopus and Google Scholar for studies published between January 2021 and May 2025. From 1876 initially identified articles, 94 met the prespecified inclusion criteria following quality assessment using the QUADAS-2 and ROBINS-I frameworks. A random-effects meta-analysis of diagnostic accuracy was performed using the DerSimonian-Laird estimator, with statistical heterogeneity quantified using I² and publication bias assessed using funnel plot asymmetry and Egger’s test. Computational efficiency was standardised to GigaFLOPs using a reference input of 240 × 240 × 155 voxels (BraTS benchmark), with FLOP estimates derived from primary publications where available and bounded by theoretical complexity formulas otherwise, with estimated values explicitly distinguished throughout.

RESULTS: Across all 94 included studies, the pooled diagnostic accuracy was 93.5% (95% CI: 92.7%-94.4%); however, confirmed publication bias (Egger’s p = 0.043) indicates this represents an upper-bound approximation rather than an unbiased population estimate. Because subgroup study counts were insufficient for formal random-effects pooling (CNN-only: n = 3; Transformer-only: n = 2; CNN-Transformer hybrid: n = 4; minimum recommended n = 10 per subgroup), no subgroup meta-analysis was performed. Instead, descriptive mean accuracies are reported as hypothesis-generating observations only: CNN-only models 91.7%, Transformer-only models 93.6% and CNN-Transformer hybrid models 94.6%. These figures must not be interpreted as pooled meta-analytic estimates; they reflect mean observed accuracy across a small number of included studies and are reported solely to illustrate directional trends consistent with the mechanistic rationale for hybridisation. Substantial heterogeneity was observed (I² = 78.3%; p < 0.001). Three integration paradigms were identified: sequential (45% of models; 93.8% accuracy; 1.8 GFLOPs), parallel (32%; 94.3%; 2.8 GFLOPs) and hierarchical (23%; 94.9%; 3.5 GFLOPs). Parallel architectures demonstrated optimal clinical viability, balancing accuracy with a mean inference time of 2.1 s. GAN-based augmentation improved rare tumour class detection by 7%-10%, with conditional GANs outperforming vanilla architectures. Multimodal MRI + PET fusion achieved 94.2% accuracy at 2.8 GFLOPs, whereas triple-modality integration yielded marginal additional gains (95.1%) at substantially elevated computational cost (9.1 GFLOPs). Notably, 65% of included studies used the BraTS benchmark exclusively, and hybrid model accuracy declined from 94.6% on high-grade gliomas to 88.3% on low-grade gliomas, with hybrid architectures exhibiting 2.3× greater susceptibility to Gaussian noise than CNN-only equivalents, limitations that constrain generalisation to real-world clinical settings.

CONCLUSIONS: Descriptive comparison of mean observed accuracies based on study counts is insufficient for confirmatory meta-analysis, suggesting hybrid CNN-Transformer architectures may offer diagnostic accuracy advantages over CNN- and Transformer-only approaches; this observation is hypothesis-generating only and requires validation in a larger, more balanced evidence base. Among integration strategies, parallel architectures demonstrated the most favourable accuracy efficiency balance in the reviewed evidence. GANs and multimodal imaging function as essential architectural enablers, addressing data scarcity and diagnostic incompleteness, respectively. Significant challenges remain in computational efficiency, noise robustness and generalisation to rare tumour subtypes, representing priority directions for future research.

PMID:42312286 | PMC:PMC13270495 | DOI:10.1155/ijbi/4763936

By Nevin Manimala